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I . INTRODUCTION 



The score obtained by an individual on an achievement 
test, will, depending on the standard employed, generally pro- 
vide two types of information. If information pertaining to 
an individual's standing in reference to others in a particu- 
lar group is desired, a relative standard is employed. Glaser 
referred to such a measure as a "norm-referenced measure."^ 
Scores on norm-referenced measures are typically irn the form 
c* percentiles, equivalent scores, standard scores, etc. 

If information pertaining to an individual’s level 
of mastery of some specified criterion is desired, an abso- 
lute standard is employed. Such a measure is referred to by 
Glaser as a "criterion-referenced measure." The same dis- 
tinction has been made fcy Ebel with "Normative Standard Scores" 
versus "Content Standard Scores,"^ and Flanagan with "standard" 



Robert Glaser, "Instructional Technology and the 
Measurement of Learning Outcomes; Some Questions," American 
Psychologist , XVIII (1963), 520. 

^ Ibid . , p. 5X9. 

3 

Rw^jift L« Bbel, "Content Standard Test Scores," 
Educational and Psychological Measurement , XXIX (1962) , 15. 
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versus "norm** scores. According to Glaser? , 

Measures which assess student achievement in 
terms of a criterion standard thus provide infor- 
mation as to the degree of competence attained by 
a particular student which is independent of ref- 
erence to the performance of others. ^ 

In many classroonii situations the use of norm-referenced 
measures has been emphasized. Typically, the instructional 
sequence, materials, and rate of progress for each student in 
the class are held constant* At the end of some specified unit 
an achievement test is administered to the entire class at the 
same time. The student's scores are then ranked in relation to 
each other, or in some instances, the scores are ranked lind 
interpreted in reference to some normative group. 

While providing information concerning tlie number of 
right and wrong answers and the relativii standings of indi- 
viduals, th€ score on a norm-referenced test does not indicate 
what specific behaviors the student has mastered. Except in 
the extreme cases where every item is passed or failed, raw 
scores or percentages indicate only the number of questions 
answered correctly. Converting the raw scores to percentiles, 
standard scores, equivalent scores, etc*, still provides no 
information concerning the particular skills the student has 
or has not mastered. "From a percentile we know the location 



4 

John C. Flanagan, "Units, Scores, and Norife^s," 
Educational Measurement , E. F. Lindquist, editor {Washington, 
D.cT: American Counci 1 on Education, 1951), p. 698. 

5 

Glaser, 0 £. cit * , p« 520. 
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of a pupil's score in the distribution of scores of the norma- 
tive pupils, we still do not know how much arithmetic a pupil 

£ 

underr.tands, ” Different scores indicate that different items 
have been answered correctly, but not what items were answered 
correctly. The same scores do not necesijirily indicate that 
the sasrte items have been passed^ success on many different 
items has probably occurred. To determine the specific be- 
haviors which have been mastered, the individual items need to 
be examined. 

In many cases knowing the ranks of the individuals 
is suf ficienc. But with the development of programs of indi- 
vidualized instruction, such as programed learning or non- 
graded classrooms, criterion-referenced measures have become 
increasingly important. In individualized instruction each 
student sets his own pace for learning and in the process may 
pursue varied curricul\ur« sequences and materials. The per- 
formance criteria for a specified unit of work may be identi- 
cal for all students however, their performance being compared 
to an absolute rather than a relative standard. Minimum 
levels of mastery are established which the student must meet 
before progressing to the next unit. Tests for units are not 
administered to the group as a whole, but to individuals as 



Fred T. Tyler and Walter R. Stellwagen, "The Search 
for Evidence about Individual Differences," Individualizing 
Instruction , Sixty-first Yearbook of the National S^ocTety for 
the Stu<ly of Education, Part I (Chicago, Illinois: The 

University of Chicago Press, 1962), p« 99. 
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they complete these units in their instructional sequence. 

The score on this type of test is used to determine whether a 
student progresses to the next unit* His score is compared 
to the criterion established for the unit; not to the scores 
of others in the group* 

The ''impact of individualized instruction on testing 
and diagnosis in the schools" has been discussed by Coulson 

7 

and Cogswell* They stated that the trend toward indi- 
vidualized instruction "*.* is not an isolated phenomenon, 
independent of other educational activities such as testing.*.* 
The authors spoke of the need to develop "..* a go/no go test 
determining whether a student is ready to graduate or to pro- 
gress to the next study unit**.." If such diagnostic tech- 
niques as Coulson and Cogswell described could be developed, 
the authors suggested that "they should provide not only a 
means for more effective instruction, but also a basis for 
constructing more useful theories of education and learning. 



John E* Coulson and John F* Cogswell, "Effect of 
Individualized Instruction on Testing," Journal of Educational 
Measurement , II (1965), pp. 59-64. 

® Ibid ., p. 59. 

^ Ibid. 
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Ibid. , p. 63. 




The need for further consideration in test develop 
ment has also been recommended by Glaser who stated: 



5 



Test development has been dominated by the 
particular requirements of predictive , corre- 
lational aptitude test 'theory*' Achievement 
and criterion measurement has attempted fre- 
qusntly to cast itself in this framework. 

However f many of us are beginning to recognize 
that the problems of assessing existing levels 
of competence and achievement and the conditions 
that produce them require some addxtxonal con- 
sideration, ii 

Since criterion-referenced measures are directly 
concerned with "assessing existing levels of competence and 
achievement," they should provide infonnation concerning the 
students' successes and failures on specified behaviors. 



Whether this information can be obtained from the raw 
score on a criterion-referenced test, or whether, as in the 
case of norm-referenced tests, the individual test items 
require examination remains a problem. 
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II. REVIEW OP REIiATED RESEARCH 



A. Attempts to Obtain Test Scores Providing Further Information 

Attempts have been made to provride scores yielding 

further information than that furnished by norm-referenced 

scores. Grossnicklfs employed Thurstone®s paired coinpariisons 

technique to investigate the possible scaling of individuals 

12 

making certain test scores. The desire was to obtain scores 
which would remain relatively the same for individuals re- 
gardless of the group in which they were placed, A biology 
test of 100 items was administered to 100 persons whose scores 
were then ranked from highest to lowest. These scores were 
then grouped by 5*s to form twenty "hypothetical individuals/' 
the top five scores being individual number 1, etc. Scaled 
scores for these twenty individuals were obtained. 

A new group of thirty persons was selected and given 
the test. These persons were also combined into six "indi- 
viduals, " and scaled scores obtained. Four "individuals" 
from the original group were then randomly selected and com- 
bined with the latter group of six. While the four scaled 
scores did not remain tl\e same in the new group as in the old, 
the distance between the scores remained stable. Grossnickle 
concluded that "this experiment using the paired comparison 

method, has proved that it is possible to scale individuals 

13 

taking any mental and educational test." 



12 

Louise T. Grossnickle, "The Scaling of Test Scores 
by the Method of Paired Comparisons," Psychomet r i cka , VII (1942), 
pp. 43-64. 



13 



Ibid. , p. 62. 
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Such a conclusion seems somewhat unwarranted* The author 
claimed this truth for individuals » yet she never dealt with 
individuals; also, she generalized to the population of all 
mental and educational tests from one biology test. No 
additional meaning could really be attached to the obtained 
scores since they changed depending on the reference group. 

A further attempt to add meaning to test scores was 

reported by Tucker at the 1952 Invitational Conference on 

Testing Problems. According to Tuckers 

. • . experimental and analytic methods for test 
development and score scaling may exist or be 
developed which do not depend on the relative 
number of examinees who receive each particular 
score in a reference group of examinees. 14 

In keeping with this suggestion Tucker attempted to obtain 

scores relating individuals* proficiency on a skill to the 

difficulty of a task performed at a marginal degree of success. 

He provided the following example: In receiving telegraph 

code an individual will make fewer errors receiving slow signals 

than fast signals. At some speed he would receive with 90% 

accuracy. This signal speed could be used to characterize that 

individual's level of proficiency. Tucker proposed a system 

for defining a scale of difficulty for intellectual skills. 



Ledyard R. Tucker, "Selecting Appropriate Score 
Scales for Tests— Scales Minimizing the Importance of Reference 
Groups , *' Proceedings , 1952 Invitational Conference on Testing 
Problems 1 Prince ton? Educational Testing Service , T5’53) , pp. 27-28. 

15 

Ledyard R. Tuckert^ ”A Level of Proficiency Scale 
for a Unidimensional Skill," American Psychologist, VII (1952), 

408. (Abstract) 
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Such a model involved several steps which included 
estaolishing subgroups of individuals with approximately equal 
skills, obtaining the proportion of successes on each task for 
each subgroup, and determining a scale value for each subgroup* 
Tucker reported that results from an application to a set of 
verbal analogy items indicated promising possibilities, but 
he provided no data* He proposed a further tryout involving 
the scaling of vocabulary items from fourth grade through 
college. It would appear, however, that this technique will 
provide a score similar to a mental age, rather than indicating 
what specific behaviors have been mastered. 

A similar attempt to provide scores indicative of a 
level of proficiency has been reported by Ebel. He discussed 
two studies concerned with providing "content standard scores." 
These scores are based directly on the tasks which compose the 
content of the test, and are defined as a "percent of a sys- 
tematic sample from a defined domain of tasks which an indi- 
vidual has performed successfully."^^ 

Ebel constructed a test of knowledge of word meanings 
based on a sample of 100 words from a specified dictionary. 

The words were arranged in alphabetical order; the task was to 
match the words with their corresponding definitions. Ebel 
stated that "these tests constitute one operational definition 
of the proportion of words in a certain dictionary for which a 
person 'knows ' the meaning. • . . 



Ebel, "Content Standard Test Scores," Educational and 
Psychological Measurement, XXII (1962) , 15. 
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Ibid., pp. 24*-25. 
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Ten items were also selected by £bel from the mathe- 
matics section of the 1959 Preliminary Scholastic Aptitude Test* 
Initially all fifty items of the test were classified into ten 
content categories ^ "Calculations with fractions » Verbal 

problems. Percentage and statistics," etc. The discriminating 
power of each item was found by subtracting the proportion of 
correct responses in a low scoring group (PSAT scores below 300) 
from the proportion of responses in a high scoring group (PSAT 
scores above 700) • The item in each category with the highest 
discriminating power was chosen.. These items were then scored 
on six sets of 100 answer sheets which had PSAT scores near 
750, 650, 550, 450, 350, and 250. The most frequent score on 
the ten items was found for each group. For example, a score 
of 9 was the most frequent for those scoring 750; a score of 3 
was most frequent for those scoring 450. Therefore, a score on 
the ten item test was taken as an indication of the score on 
the PSAT. 

In each of these two examples the test provided in- 
formation related to content, however, no information relating 
to the mastery of specific skills was obtained. In the former 
the score indicated a percentage of the wordt§, in the latter 
the score indicated the most frequent score obtained for a given 
group, but did not indicate what score a given individual would 
obtain nor to what items that score pertained. Also, in refer- 
ence to the latter of Ebel's examples, the items were chosen 
to discriminate between scores of 700 and 300. The author 
evidently assumed that these same items would have the best 
discriminating powers for the other groups reported. 
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® Solgtiona for Interpreting Specific Behaviors 
from Test scores 

One solution to obtaining a criterion-'referenced test 
whose score would indicate specific behaviors mastered by a 
student would be a test whose items were sequentially scaled. 

The items would be so arranged that once an individual missed 
an item he would miss all subsequent items in the test. For 
example^ an individual obtaining a score of 0 would have answered 
each of the first eight items correctly and none of the items 
beyond 8 correctly; a student failing item 3 should fail all 
subsequent items. According to Ebel? 

It is possible to imagine a test which would 
give highly consistent results across items and 
across students when administered to a particular 
group. I^esults would be called consistent if 
success by a particular student on a particular 
item practically guaranteed success on all other 
items in the test which were easier for the group 
than that item. Correspondingly^ failure on a 
particular item would almost guarantee failure 
on all harder items if the student responses were 
highly consistent. •• • Such tests can be imagined 
but are seldom met with in p.ractice.13 

Two techniques relating to scaled tests are Loevinger's 

IQ on 

"homogeneous tests" and Guttman's "scalogram analysis." 



Robert L. Sbel, Measuring Educational Achievement 
(Englewood Cliffs: Frentice*Hail# Inc., 1^^^)* pp. 

1 9 

Jane Loevinger, "A Systematic Approach to the 
Construction and Evaluation of Tosts of Ability," Psychological 
Monographs , LXI (1947), No. 285; Jane Loevinger, "The Technic 
of Homogeneous Tests Compared with Some Aspects of 'Scale 
Analysis' and Factor Analysis," Psychological Bulletin, XLV (1948), 
pp. 507 « 29 . 



20 

Louis Guttman, "A Basis for Scaling Qualitative Data," 
American Sociological Review , IX (1944), pp. i39«150; 
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Loevinger defined "perfectly homogeneous tests" of ability 

as tests "such that, if A*s score is greater than B*s score, 

then A has more of some ability than B, and it is the same 

21 

ability for all individuals A and B who may be selected." 

She proposed a "coefficient of homogeneity," ranging in value 
from zero to one, which is the ratio of the difference between 
the variance of a given test and the variance of a "perfectly 
heterogeneous test" with the same distribution of item diffi- 
culties, to the difference between the variance of a per- 
fectly homogeneous test with the same distribution of item 
difficulties and the perfectly hetergeneous test: 

Homogeneity (Hj^) • Vx - Vhet 

Vhom - Vhet 

A "perfectly heterogeneous" test is defined as a test "composed 

of items each of which measures an ability independent of the 

22 

abilities measured by the other items." 

What Loevinger desired was a test which was consis- 
tent with respect to the ability being measured. People who 
obtained the same scores would have answered the same items. 



Louis Guttman, "The Problem of Attitude and Opinion Measure- 
ment," in Samuel A. Stouffer et al.. Measurement and Predic - 
tion (Vol IV of Studies in SocTar ^Psycnology in WorTd War 
II . 4 vols.j; Princeton: Princeton University Press, 1950), 

pp. 46-59. 

21 

Loevinger, "A Systematic Approach," p. 28. 



correctly. While being at different levels of difficulty, 
the items of the test had to measure the same content that 
defined the ability. Behavior could be inferred from test 
score by applying one of Loevinaer's theorems, "When the items 
of a perfectly homogeneous test are arranged in order of in- 
creasing difficulty, every individual will pass all items up 

2 3 

to a certain point and fail all subsequent items." 

Of concern to Loevinger was whether the proposed 
estimate of homogeneity was unbiased. Some evidence that the 

0 A 

estimate may be biased was provided by Carroll. Employing 
random numbers and hypothetical individuals he found Loevinger *s 
coefficient "to be biased positively because of chance varia- 
tions in item difficulties." To be homogeneous in the 
Loevinger model the items should measure the same ability but 
at varying levels of difficulty. Carroll, however, was able 
to obtain a value as high as .32 for Loevinger 's coefficient 
of homogeneity with items varying in difficulty only by chance. 

The Loevinger technique is limited, however, to tests 
composed of items of the same content. In most tests of ability 



P- 



508. 
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Ibid. 5 



Loevinger , 



"The Technic of Homogeneous Tests," 



John B. Carroll, "Criteria for the Evaluation of 
Achievement Tests— 'from the Point of View of Their Internal 
Statistics," Proceedings , 1950 Invitational Conference on 
Testing Problems (Princeton: Educa t iona r Te s t i ng Service, 1951) 

ppTli5-9f7“~~ 



25 



Ibid. , p. 97. 
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the content is likely to vary within the test. 

The second technique, Guttman's "scalogram analysis” 
does not require iteir.5 of the same content. According to 
Suchman : 

It is also important to remember that scale 
analysis should not be depended upon to determine 
content. An item of differing content may fit 
into the scale pattern of an area, while items 
with homogeneous content need not scale .^6 

Edwards and Kilpatrick have also noted this characteristic of 

scalogram analysis: 

The merits of scale analysis, as a technique 
for evaluating a set of items, are obvTous and 
need no defense. But scale analysis can be applied 
to any set of items, regardless of how the set is 

selected. 27 

Arising from problems in attitude scaling and opinion 
polling, scalogram analysis attacks directly the problem of 
determining behavior from score. As Guttman stated: 

From a person’s score we would then know 
precisely to which problems he knows the answers 
and to which he does not know the answer. Thus 
a score of 2 does not mean simply that the person 
got two questions right, but that he got two 
particular questions right, namely, the first and 
second. A person's behavior on the problems is 
reproducible from his score. 28 



26 

Edward A. Suchman, "The Scalogram Board Technique 
for Scale Analysis/’ in Samuel A. Stouffer ^ al.. Measure - 
ment and prediction (Vol. IV of Studies Social Psychology 
in World War II . 4 vols,; Princeton: Princeton University 

Press, l95bj, p. 119. 

?7 

Allen L. Edwards and Franklin P. Kilpatrick, "Scale 
Analysis and the Measurement of Social Attitudes," Psychometrika , 
XIII (1948) , 109. 



28 



Guttman, "A Basis for Scaling," p. 143. 
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Guttman further discussed the possibility of deter 
mining behavior from score when he stated: 

Scale analysis tests the hypothesis that a 
group of people can be arranged in an internally 
meaningful rank order with respect to an area 
cf qualitative data. A rank order of people is 
meaningful if, from the person's rank order, one 
knows precisely his responses to each of the 
questions or acts included in the scale .29 



C. The Technique of "Scalogram Analysis " 

Guttman defines a ocale as each person's responses 

30 

being reproducible from his rank alone. The technique for 

determining the existence of a scale involves essentially 

two steps: ( 1 ) ranking the items from "most extreme" to 

"least extreme,” i.e., the item chosen or answered correctly 

by the fewest people ("the most extreme") to the item chosen 

or answered correctly by the most people ("the least extreme”) $ 

and (2) ranking the individuals from lowest to highest on the 

basis of total score. If a scale exists the resultant pattern 

when correct and Incorrect responses are tabulated will be a 

parallelogram (or a triangle if only correct responses are 
31 

recorded). The following example will provide an illustration 



29 

Louis Guttman, "The Basis for Scalogram Analysis," 
in Stouffer ^ al • # Measurement and Prediction , p. 8S. 

Ibid., p. 62. 

Ibid., pp. 60-90. 
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of the resultant patterns. Consider a five item test ad- 
ministered to five p^jple and found to be a perfect scale. 
Figure 1 shows the parallelogram pattern when both correct 
and incorrect responses are recorded (X*s). 
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FIGURE 1 

PARALLELOGRAM PATTERN OF A PERFECT 
FIVE ITEM SCALE 

In discussing the rank ordering of individuals and 
items from such & pattern Suchman said: 

Such a rank order has the property of 
permitting one to derive from the rank order the 
exact characteristics of the individuals in that 
rank since there is only one possible combination 
of items for any single rank order. Furthermore, 
the rank order has the quality that any indi- 
viduals in a higher rank possess all the charac- 
teristics of the individuals in a lower rank, 
and at least one more in addition. ^2 

The above description pertains to a perfect scale, 
that is, each individual's response to each item can be per- 
fectly reproduced. Such perfect scales are usually not found 
in practice, just as perfectly reliable tests are not iound in 
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practice. The degree to which the instrument approximates 
a perfect scale is measured by a "coefficient of reproducibility." 
The coefficient is defined as the "empirical relative fre- 
quency with which values of the attributes do correspond to 
intervals of a scale variable. Thus, the coefficient 
provides an indication of how well an individual's response 
pattern can be reproduced knowing his total score. The value 
of .90 was arbitrarily established as an acceptable lower limit 
of the coefficient. 

As described by Guttman in his original article^^ and 
again by Suchman in 1950,^^ scalogram analysis was performed 
by the use of scalogram boards. These are devices which, 
through the use of balls and slats, permit the shifting of the 
rank orders of individuals and items (or the combination of 
categories for items with multiple categories) to obtain the 
best scale. 

The use of scalogram boards has not always been feasible 

however; the cost of the boards is prohibitive, they have a 

36 

fixed capacity, and have been termed cumbersome. In answer 
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et al • , Measurement and Prediction , pp. 91-121. 
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Matrices," Psychoroetrika , XVIII (1953), pp. 111-113; Leon 
Festinger, "The Treatment of Qualj itive Data by 'Scale Analysis,' 
Psychological Bulletin , XLIV (194^^, pp. 149-161. 
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to such criticism Guttman devised a paper and pencil tech- 
nique, the "Cornell Technique" to supplant the scalogram 
37 

boards. The technique is applied to the data in the same 
way as the boards, shifting the rank orders to obtain the best 
arrangement of items. In either case, whether applying scalo- 
gram boards or the Cornell Technique, the reproducibility 
coefficient is computed in the same manner: (1) errors are 

tabulated by counting the number of responses occurring out- 
side the cut-off points for each score i (2) the errors ar© 
divided by the total number of possible responses (number of 
people X number of items), and (3) the obtained quotiesr^ is 
subtracted from 1. 

Reproducibility (R) « 1 - 

No. Items X N 

D. Criticisms of Scalogram Analysis 

The use of the reproducibility coefficient has been 
criticized in the literature. Several authors have found the 
coefficient to be arbitrary and to be affected by the diffi- 
culty levels of the test items. The major objection is that 
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tb® reproducibility coefficient can be spuriously high be- 
cause of i usms with extreme marginal frequencies. Guttman 
was aware of the effect of extreme marginals on the repro- 
cibility coefficient, however, and never stated that a high 
reproducibility coefficient was the criterion for scalability." 
"Reproducibility by itself is not a sufficient test of 
scalability. It is the principal test, but there arc at least 
four other features that should be taken into account. 

The four additional criteria which are employed to in- 
sure against spurious reproducibility are: (1) The number 

of items in the test should exceed 10. (2) The more categories 

that could remain uncombined, the more credible the inference 
of scalability; this criterion does not apply to dichotomous 
items. (3) The marginal distributions should contain as wide 
a range as possible, but with few extreme marginals. In the 
case of dichotomous items an extreme marginal would be a 
category chosen by 80% or more of the individuals. The repro- 
ducibility of an item can never be less than the frequency 
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Journal of Opinion aifid Attitude Research , III (1949) , pp.47- 
86; H. “J. Eysenck , "Measurement and Prediction; A Discussion 
of Volume IV of Studies in Social Psychology in World War 

International Journal of Opinion and Attitude 
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Bulletin , LIV (1957), pp. 81-99. 
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of the most frequentlj^ chosen category, (4) The errors 
should not fall into a pattern^ i*e,» there should not be a 
series of persons, all having identical errors 

A single criterion to evaluate the spuriousness of 
the reproducibility coefficient has been suggested by Menzel.^^ 

He developed a "coefficient of scalability" having the fol- 
lowing advantages: (1) it provides a safeguard against 

spuriousness without relying on extraneous rules, (2) it does 
not introduce the judgment of the investigator in applying a 
rule, and (3) it permits analysis of scalograms that had to 
be rejected because of extreme marginals. The coefficient may, 
therefore, show that high scalability exists in spite of ex- 
treme marginals. 

The coefficient is computed by (1) obtaining the errors 
as in the computation of the reproducibility coefficient, (2) 
for dichotomous items— summing the non-modal marginal fre- 
quencies for items and for individuals, and taking the smaller 
of the two scores, and (3) dividing the errors by the sum ob- 
tained in step 2 , and subtracting the resultant quotient from 
It A minimum value of .65 was established as the criterion for 
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scalability* Among those who have recommended and employed 
the coefficient of scalability in conjunction with the repro- 
ducibility coefficient are Auld, Eron, and Laffal; Lesser; 

43 

and Pearson. 

Not meeting the criteria for scalability in scalo- 
gram analysis should not be interpreted as showing the non- 
existence of a scale according to Eysenck and Crown* "Ultimately 
we shall achieve the position of the physicist whose scales 
show approximately 100% reproducibility, [but] there is little 

to be gained by decrying the very real usefulness of many of 

44 

our own present-day scales." In essence these authors are 
arguing for use of less reproducible scaJes, but not arguing 
against highly reproducible scales. 

Continuing the criticisms of the reproducibility 
coefficient I the reason for using the reproducibility coef- 
ficient was questioned by Davis when he stated, "We compute 
a reproducibility coefficient not because we have any real 
desire to reproduce response patterns from scale scores but. 



Frank Auld, Jr., L. A. Eron, and J. Laffal, 
"Application of Guttman's Scaling Method to the T. A. T.," 
Educational and Psychological Measurement , XV (1955) , pp» 422-435 
Gerala S • Lesser , "Application of Guttman ' s Scaling Method to 
Aggressive Fantasy in Children," Educational and Psychological 
Measurement , XVII (1§58) , pp. 543-5Sir Richar3™S. Pearson, 

"Plus Percentage Ratio and the Coefficient of Scalability," 

Public Opinion Quarterly , XXI (1957) , pp. 379-380. 
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rather# because we hope that it is an index of certain mea- 

45 

surement properties.** Contrary to Davis* assumptions# however# 
reproducing responses from scores ^ the goal established in 
the present study. 

Smith disclaims the notion of reproducibility in 
testing for the existence of any scale. He reached this con- 
clusion because he obtained two group factors on factor- 
analyzing the data reported by Guttman in the 1947 article 
concerning the Cornell Technique* According to Guttman only 
one factor should have been found. Guttman *s reply was that 

A T? 

Smith *s "numerical work cannot be anything but pure nonsense." 
Guttman showed that Smith reported perfect correlations between 
items 2 and 4# and between items 3 and 5# yet 2 correlated 
differently with the other items than 4# and 3 correlated 
differently with the other items than 5. Smith *s matrix was 
non-Gramian and as a result could not be factored by the 
Thurstone Technique which he employed. 

In addition to the coefficient of reproducibility# other 
aspects of scalogram analysis have been subjected to criticism. 
The determination of cut-off points for scores has been found 
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by some authors to be arbitrary and difficult.^® Their 
criticisms have pertained to attitude scales having items 
with three or more response categories. In an achievement test 
having dichotomous items r where the desire is to infer behavior 
from score, the various scores would determine the cut-off 
points. Theoretically an individual should answer items to a 
certain point and then stop. Therefore, a score of 4 would 
cut off the first four items, etc. As a result, the above 
criticisms would not apply. 

Contrary to the above example, opinions have been 

expressed that individuals do not usually answer items up to a 

certain point and then stop. Rather, a gradual transition 

from correct to incorrect has been suggested. Erown, Bartelme, 

and Cox proposed that "the score of the individual is then at 

that point on the scale at which the average deviation of the 

right items above it equals the average deviation of the wrong 

49 

items below it." The authors based their conclusions on 
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results obtained from the Gesell Development Schedules . Glaser 

hypothesized that on certain tests measuring one dimensiori, when 

the test items are ordered in terms of their scale position, 

there is a region of transition from positive to negative 
50 

responses. He further hypothesized that the distribution of 

inconsistent responses in this region is approximately normal. 

Glaser analyzed the following tests: The Faust~Schorling Test 

Test of Functional Thinking in Mathematics , the Differential 

Aptitude Space Relations Test , and a vocabulary test composed 

of items from the Stanford-Binet , Wechsler^Eellevue , Wide Range , 

and Columbia Vocabulary tests. Each test was composed of 80 

items. The results showed the distribution for the vocabulary 

test to be approximately normal. The distributions for the 

mathematics and space relation tests were truncated. Glaser 

attributed the truncated distribution to the restricted range 

51 

of test items. If more items at higher levels of difficulty 
could have been added the distribution of responses might well 
have approximated normality. 

The results should be interpreted in the light of the 
type of test employed in each of the above studies. All tests 
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were published tests which were not constructed to yield scores 
from which behavior could be inferred. Xn addition, the range 
of gradual transition from pass to fail could well be attributed 
to very gradual transitions in item difficulties accompanied 
by a large number of items at each difficulty level. It could 
be hypothesized that as the number of items increases, and the 
steps between item difficulties becomes more gradual, the dis*- 
tribution of inconsistent responses would approach normality. 

The studies cited above lend some support for this. The opposite 
could also be hypothesized, i.e., with a decreasing number of 
items and with more discrete steps between difficulties, the 
distribution of inconsistencies would depart from normality. 

The truncated distributions obtained by Glaser offer some evi- 
dence in support of this* Carrying the latter hypothesis to 
its extreme, it could be hypothesized that at some point indi- 
viduals would no longer have inconsistent responses but would 
answer items to a certain point and then stop. It is this type 
of test that is suggested in the present study, and scalogram 
analysis is suggested as a technique to yield such a test* 

Scalogram analysis has also been criticized by some as 

52 

inadequate for the selection of items. The reply to these 
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findings is i^imply that scalogram analysis is not an item 

selection technique. As Guttman states, "We have continually 

stressed that items are to be selected before any statistical 

analysis is performed, and are not to be rejected because of 

any statistical analysis. ••• Scale analysis is not a technique 

for item selection and rejection, but rather for studying the 

53 

structure of a universe. ..." 

For Guttman, the universe "is the concept whose 
scalability is being investigated, such as marital adjustment, 
opinion of British fighting ability, knowledge of arithmetic, 
etc. The universe consists of all the attributes that define 
the concept." One aspect of the theory of scalogram analysis 
is that from a sample of items comprising the scale "inferences 
can be drawn concerning the complete distribution of the uni- 
verse for the population.... The hypothesis that the complete 
distribution is scalable can be adequately tested with a sample 
distribution."^" Criticisms of this aspect seem, to this in- 
vestigator, to be warranted. It would appear that while a sample 
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of items could well be scaled# as for example# in the following 
3 item test: 



2 + 2 



aTT 



c. 



2xdx 



the conclusion that the universe of mathematics is scalable is 
not tenable. Many skills in mathematics depend on the order 
taught; many skills a;re parallel# being of the same difficulty. 
Schuessler argued that the sample results are both a function 
of the way the universe is defined by the investigator and the 
manner in which the items are chosen from a field of content 
defining the topic. This implies that conclusions concerning 
the scalability of a universe may be restricted to a given 
investigator’s version. 

Further criticism of this aspect of scalogram analysis 

57 58 

was provided by Torgerson and Campbell and Kerckhoff. 

Each warned against generalizing to the universe from a sample. 

Campbell and Kerckhoff stated that the proposition# "If a 

universe is scalable any sample selected from the universe will 

be scalable#" is not identical to the proposition# "If a sample 
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is scalable the universe from which the sample is selected 
is scalable.*' The authors suggested that if the latter 
proposition is warranted any other items from the same uni- 
verse should also scale with the original set. They reported^ 
however, that judges have not been consistent in making these 
additional selections, but, unfortunately, no empirical 
evidence was provided. 

The above criticisms are concerned, however, with 
relating a sample to the universe, not with scaling a sample 
of items by applying the scalogram technique. A test is 
considered as being compoiled of a sample of items representing 
the population of possible items pertaining to a given area. 

The proposed study is to determine if a test wxll scale, if the 
test will yield a score from which behavior can be inferred. 

If the Guttman scalogram technique can be applied to achieve- 
ment testing in order to obtain such scores, the ensuing con- 
clusions will be concerned with the sample of items, the test, 
not to the universe represented by the sample of items. 

E. Applications of Scalogram Analysis 

The application of scalogram analysis to achievement 
testing has been suggest by Guttman on several occasions. 

"Scale analysis is applicable much more widely than to attitudes. 
For example, it is useful for mental tests and examinations."^^ 
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He also described its use “with large classes of behavior 
like achievement tests. "For achievement tests, where all 
items are dichotomous —>being marked either right or wrong— 
the Cornell technique is perhaps the best of all to be used.*'^^ 
While Guttman has suggested that scalogram analysis be 
applied to achievement testing it has been employed almost 
entiirely in other areas. The most widespread application has 
been in the areas of opinion and attitude measurement. In 
Volume IV, Studies in Social Psychology in World War li , 

Guttman refers to at least seven different studies related to 
various attitudes of soldiers during the Second World War, 

Among the many attitude studies reported in the literature, a 
brief list includes: (1) Niven's comparison of the Cornell 

Technique and the Reciprocal Averages technique to the attitudes 
of manufacturing supervisors,'^ (2) an investigation of atti- 
tude toward economic liberalism— conservatism,^^ (3) opinion 
toward science, (4) attitudes toward negroes, (5) the 
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Louis Guttman, "On Festinger's Evaluation of Scale 
Analysis," Psychologic al Bulletin , XLIV (1947J , 458. 
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development of an attitude scale on anti^aemitism,^^ (6) the 
scaling of interview responses bearing on the favorability 
of attitude toward marriage, and (7) Dodd's application to 
opinion polls in general.^® 

Other areas have also utilized application of the 
Guttman technique. Its successful application to projective 
techniques has been shown by Auld, Eron, and Laffpl®® and 
Lesser,^® Auld, et , applied scalogram analysis to themes 
from four selected pictures of the Thematic Apperception Test 
given to 100 sailors attending submarine school. While the 
authors did not succeed in constructing a scale of aggression, 
they did succeed in constructing a scale of sexual motivation. 
Lesser applied the scaling procedure to the fantasy aggression 
responses of a sample of pre-adolescent boys. Again the cri- 
teria for scalability were met. 

®®Eysenck and Crown, "An Experimental Study," pp. 47-86. 
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In addition to projective techniques » scalogram analysis 

has had application in other diversified areas* Riley et al . , 

used the techniques to scale groups and objects of group 
71 

action* For example, in a group certain pairs talk of movies, 
others of mcvies and petting, but no groups talk of petting 
alone* Movies and petting were scaled on the degree of in- 
timacy which they represented as subjects for conversation* The 

application of scale analysis to the scaling of objects was also 

72 

reported by ld>ell* Through the use of questionnaire items 
dealing with homemaking practice, Abell found that foods served, 
use of preservatives, and vegetables grown were scaled* 

Kofsky fouud tasks involving the classification of 
objects to be scaled for children of ages four to nine*^^ The 
classification schemes were based on the developmental se- 
quence of cognitive skills hypothesized by Piaget* The se- 
quence essentially involved first, sorting two objects according 
to a common feature, then, three or more objects were sorted 
by a common feature, next, the concepts "some** or **all" were 
introduced, then, objects were classified into more than one 
group, and, finally, subsets and combinations of subsets were 
sorted* 
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Rater observations were employed to scale a check 

74 

list for technical skills in Naval electronics. The skills 

were ordered as to amount of inservice training required. The 

results indicated that the check list of technical skills was 

scalable. Similarly » a list of behaviors was found to be 

scalable by Scott in applying scalogram analysis in the invest!- 

75 

gation of delinquent behavior. The list was obtained from 
a questionnaire covering offenses such as stealing. The res- 
pondents were asked to indicate the frequency with which they 
had committed each type of offense. 

While Gn ttroan had suggested its use for achieveinent 
testing, the evidence of application of the technique in this 

area has been fragmentary. Postove employed scalogram analysis 

76 

in the development of a speechreading test. She presented 
to adults a silent film which contained 99 conversational 
sentences, the subjects being required to lip read. Scalogram 
analysis was used to obtain 15 sentences which were reported 
to be scaled. The results are questionable, however, for no 
evidence such as a reproducibility coefficient is supplicid. 
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In addition Poatove used scalogram analysis to select items 
from an item pool, a procedure contrary to Guttman's recom- 
mendations. The resultant 16 item scale was never administered, 

as such, to the group. Coughenour and Christiansen developed a 

77 

test of farmers knowledge of old-age and survivors' insurance. 

The multiple choice items pertained to distinctive features 
of the insurance and to matters of importance for farmers' 
participation in the group. The rest was administered as an 
interview, and the obtained reproducibility coefficient furnished 
evidence for scalability. 

Neither of the above studies, however, dealt with the 
application of Guttman's technique to tests employed in assessing 
achievement of school children. The only cxudy, to this 
investigator's knowledge, which related to the use of scalo- 

78 

gram analysis with classroom achievement tests was by Bligh. 

He applied the technique to the Paragraph Meaning, Study Skills, 
and Arithmetic Computation subtests of the Stanford Achievement 
Battery , Advanced Form J. The initial results did not warrant 
the acceptance of the tests as scaled; the reproducibility 
coefficients did not reach .80* The tests were refined by 
selecting items which maximized the ratios of the sums of all 
the covariances to the variances of the tests. The revised 
tests were administered to two new samples, but the obtained 
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reproducibility coefficients still did not reach the minimum 
acceptable value of .90 (the range was .818 - .872). Because 
of the magnitude of these coefficients, however, Bligh suggested 
the value of further investigation of scalability in achieve- 
ment testing. 

To this investigator's knowledge no other studies 
concerning the applicability of scalogram analysis to achieve- 
ment testing have been reported in the literature. With in- 
creasing demand for criterion-referenced measures comparing an 
individual's performance to an absolute standard independent of 
reference to the performance of others, the feasibility of applying 
this technique, in order to obtain scaled scores, should be 
determined. The results of a pilot study are encouraging.^^ 

The study involved the construction of a test in addition of 
whole numbers, covering concepts typically taught during early 
elementary education. The authors identified the objectives 
to be tested by, first, determining the terminal objective, 
then, working backwards by using as a guide the question: 

What skills were mastered previously in order to master this 
objective? A list cf fifteen objectives and sample items was 
developed (See Appendix A) . 
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to Determine the Scalability of an Elementary Math Achievement 
Test" (Paper read at the Pennsylvania Educational Research 
Association conference, Pittsburgh, Pennsylvania, April, 1965). 
(Mimeographed) ; Richard C. Cox and Glenn T. Graham, "The 
Development of a Sequentially Scaled Achievement Test" (Paper 
read at the 50th Annual Meeting of the American Educational 
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From two to five items were constructed for each 
objective. This resulted in a problem, however, for the test 
would undoubtedly not scale with more than one item pertaining 
to each objective. Rather, the test would be of the previously 
mentioned form discussed by Brown, et al., and Glaser, a test 
having a region of inconsistent responses. As a solution, each 
of the items corresponding to a particular objective were com- 
bined to form one **contrived item.** 

As an example, consider the three items: 

20 36 54 

4»11 -f42 ^33 

These items would comprise one ** contrived item" testing 

objective 8 on the list in Appendix A. Such a procedure of forming 

"contrived items" has been employed by Stouffer, Borgatta, Hays 
80 

and Henry. However, these authors formed the "contrived items" 
after the initial scale analysie as an aid to establishing cut- 
off points. In cox and Graham's study, the "contrived items" 
were formed before the analysis, the cut-off points being 
determined exclusively by total score. 

In order to obtain a substantial range of ability levels 
Cox and Graham administered the test to a kindergarten, first, 
and second grade. The students were then ranked according to 
total score, possible scores ranging from 0 to 15, with a 
contrived item considered as "passed" if two-thirds of the 
items comparing it were answered correctly. Inspection of the 



Samuel A. Stouffer et al., "A Technique for 
Improving Cumulative Scales," Public Opinion Quarterly, XVI 
(1952), pp. 273-291. 
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resultant response pattern indicated that some of the 
contrived items were not in pioper order. With elimination 
of three contrived items, one because of its dependence on a 
specific curriculum and two because of ambiguous directions, 
and with the rearrangement of the remaining twelve contrived 
items a reproducibility of .977 was obtained. In order to 
insure against spuriously high reproducibility, Menzel's 
coefficient of scalability was also calculated, and equalled 
.902. 

In order to validate these preliminary results, the 
revised test was administered to different kindergarten, first, 
and second grade children* The analysis of their score patterns 
yielded a reproducibility coefficient of .970 and a coefficient 
of scalability of .792. The authors concluded that it was 
indeed possible to apply Guttman's scalogram analysis to obtain 
a scaled achievement test. The results, while tempered by 
the test's being based on a restricted area of subject matter, 
are encouraging for further investigation. 

P. A Methodology for the Construction of a Sequentially Scaled 
Achievement Test 

While the above study focused on the applicability of 
scalogram analysis to achievement testing, a methodology in- 
corporating scalogram analysis for constructing scaled tests was 
concomitantly being implied. The methodology gleaned from the 
pilot study essentially consisti of: 

1. Selection of behavioral objectives in the curriculum 
which appear, logically, to be sequenced. 
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a. Identification of terminal objective. 

b. Employment of question, "What skills must 
have been learned previously?", as a guide 
for selection of subsequent objectives. 

2. Construction of items corresponding to each 
objective. 

3 . Combination of the itcons into one "contrived item . " 

4. Establishment of a criterion for passing each 
contrived item. 

5. Administration and scoring of the test. 

6. Application of Guttman *'scalogram analysis" 
technique including computation of the reproducibility coefficient. 

7. Computation of Mensel's "coefficient of scalability" 
to insure against a spuriously high reproducibility coefficient. 

While successfully applied to a restricted area, further 
investigation of the applicability of the above methodology to 
a wider range of content and corresponding behavioral objectives 
should be attempted. 

G. Evaluation of Sequentially Scaled Achievement Tests 

A methodology for construction is, however, only one 
aspect of test development. Another important aspect of the 
development of such a test is the assessment of the test in 
terms of the typical evaluation procedures. Evaluation pro- 
cedures commonly applied to standardized tests employed in the 
schools concern the areas of reliability, validity, and item 
analysis. Investigation of these evaluation procedures as they 
apply to scaled tests, has not, to this investigator's knowledge. 
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been attempted* There la some evidence, however, that there 
are differences in evaluation procedures for norm-referenced 
and criterion-referenced tests, of which scaled tests are a 
variety, such evidence has been reported by Cox and Vargas 
concerning item analysis procedures.®^ 

Cox and Vargas investigated the effect of employing 
differential item selection techniques to identify items which 
discriminated according to the requirements of norm and cri- 
terion-referenced tests. For their particular criterion- 
referenced situation the best item would be one which was 
failed before training and passed afterwards. The usual norm- 
referenced item analysis procedures yield items which discri- 
minate between high and low scorers after training. The authors 
cited an extreme example: a perfectly discriminating item for 

the criterion-referenced test would be one failed by all on a 
pretest and passed by all on a posttest. Such an item would be 
rejected by the norm-referenced technique at either the pretest 
or posttest level because it makes no discriminations among high 
and low scorers, being answered alike by all persons. 

The authors suggested a difference index based on dis- 
criminations made between pre and posttest groups. The ' com- 
pared this index to the standard upper 27% - lower 27% index 
computed for items on each of two arithmetic tests given as 
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Richard C. Cox and Julie S. Vargas, "A Comparison 
of Item Selection Techniques for Norm-Referenced and Criterion- 
Referenced Tests** (Paper read at the Annual Meeting of the National 
Council on Measurement in Education, Chicago, Illinois, February 
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pre and posttests in an individualized instruction program. 

CoK and Vargas indicated that if a fii.al test consisted of the 
best two-thirds of the items selected by either procedure, 
approximately seventy-five percent to eighty percent of the 
items would be the same in each case. The authors noted, 
however, that some items net discriminating between pre and 
posttest groups would be retained by the upper-lower 27% 
procedure while some of the best discriminating items between 
pre and posttests groups would be discarded. 

While the above study was not specifically concerned 
with scaled tests, it did concern the area of criterion- 
referenced measurement which includes scaled t^sts. The re- 
sults of the study suggest that how a test is to be employed 
or constructed will be a determining factor for the type of 
item analysis procedure required. These results support the 
conclusion of Husek who stated. "Unfortunately there is no 

evidence to demonstrate that [test] items which would be most 

82 

useful for one purpose are very useful for another purpose." 
Therefore, a test that is to be scaled may well require dif- 
ferent item analysis procedures from a norm-referenced test. 
Reliability and validity may be suspect for the same reasons. 
Since scaled tests of achievement have not, to this investi- 
gator's knowledge, been discussed in the literature, no 



T. R. Husek, "Different Kinds of Evaluation and 
Their Implications for Test Development" (Paper read at the 50th 
Annual Meeting of the American Educational Research Association, 
Chicago, Illinois, February 19, 1966), p. 3. (Mimeographed.) 
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information regarding the characteristics of the reliability 
and validity of scaled tests is available. Such information 
should be obtained, for if the development of a scaled test is 
to be thorough, both construction methodology and evaluation 
procedures should be discussed. 

H. Summary 

With the development of individualized instruction and 
similar educational innovations, criterion-referenced measures 
are in increased demand. With students being compared to ab- 
solute standards as criteria, what specific behaviors a student 
has mastered as well as how much he has mastered are desired 
kinds of information to be obta:.ned from the test. Similar to 
norm-referenced test raw scores, criterion-referenced test raw 
scores have, to date, supplied most information regarding the 
latter (how much) and very little information regarding the 
former. 

One solution to the problem of interpreting from a 
test raw score what specific behaviors a student has mastered 
would be a test whose items were sequentially scaled. A test 
so constructed would have the characteristic that an individual 
would pass items to a certain point. Once failing an item, he 
would fail all subsequent items. Therefore, a score of 4 would 
mean items 1, 2, 3 and 4 were passed and all other items failed. 

A technique which yields tests of this type is 
Guttman's "scalogram analysis.” while developed as a tool for 
attitude and opinion investigation, Guttman has suggested the 
use of the technique in the construction of achievement tests. 
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To date, however, scalogram analysis has been applied to most 
everything but achievement testing « In many studies the tech- 
nique has yieldevi a sequentially scaled measuring instrument. 

If the same results could be obtained for achievement tests 
their scores would provide information indicating what specific 
oehaviors the student has or has not mastered* An investigation 
of the applicability of Guttman's scalogram analysis to achieve- 
ment testing is needed. 

Encouraging results were obtained from a pilot study 
concerning the development of a sequentially scaled achievement 
test in the addition of whole numbers. Also from the pilot 
study, a methodology incorporating scalogram analysis was suggested 
for the construction of scaled achievfent tests. Further in- 
vestigation applying the methodology to a wider range of skills 
and objectives seems warranted. 

In addition to methodology for coni? t ruction, another 
important aspect of the test development process concerns 
evaluation. The evaluation procedures typically applied in the 
development of standardized achievement tests (norm-referenced 
measures) are in the areas of reliability, validity, and item 
analysis. With the exception of some evidence that criterion- 
referenced measures may require different item analysis pro- 
cedures from norm- referenced measurers « n© evidence is available 
concerning the comparability of the evaluation procedures for 
scaled tests as opposed to standardised tests. T© be thorough 
the development of the scaled tests shoiiM litslEde h&th tsst 
construction methodology and evalMatic'« prc?©idureSo 



III. STATEMEf4T Qf THE PROBLEM 



The purpose of this study is to apply a methodology , 
incorporating Guttman's "scalogram analysis,** for the construc- 
tion of sequentially scaled achievement tests, and to develop 
evaluation procedures concerning the reliability, validity, and 
item analysis of the obtained tests. 




IV. PROCEDURE 



Application of the methodology for the construction 
of sequentially scaled tests was attempted in five areas 
of arithmetic achievement: addition » subtraction , numeration, 

time telling, and concepts in money. The behavioral objectives 
selected for this study pertained to skills taught in grades 
one through three. (See Appendix B a listing of objectives 
for each test.) The sample of students was obtained from 
two schools in the Baldwin-Whitehall district of suburban 
Pittsburgh. One school, Sickman Elementary School, provided 
a sample of "conventional" classroom instruction; the other, 
Oakleaf Elementary School provided a sample of individualized 
instruction. All five tests were administered to both 
schools. 

The directions for each of the items were read to 
the students, and ample time was provided for the student 
to attempt all items. The scoring criterion employed was 
that two-thirds of the items had to be answered correctly in 
order to pass a contrived item. Where only one or two items 
con^osed a contrived item, both had to be answered correctly. 
The scalogram analysis procedures were applied (1) to the 
separate test results for each school to provide evidence 
for test scalability for the individual schools, and (2) to 
the combined test results from both schools to provide an over- 
all indiciition of the tests' scalability. To take into 
account spuriousness in the reproducibility coefficients, 

Menzel coefficients of scalability were also calculated. 
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Following the application of the methodology for the 
construction of the sequentially scaled tests, evaluation 

was attempted in the areas of reliability, validity, and item 
analysis. 

In considering the reliability of the scaled tests 
certain methods commonly employed in evaluating achievement 
tests were considered. One of these methods concerned the 
equivalence of alternate forms of a test, but this procedure 
was deemed inappropriate for the present investigation. The 
purpose of the present investigation was to attempt to 
develop a scaled test in each of five selected areas of 
arithmetic. If the methodology were successful in pro- 
ducing scales, then whenever alternate forms of the scales 
were desired, the equivalence of alternate forms procedure 
could be applied. Such a procedure provided no information 
concerning the scalability of the tests, for it is possible 
to obtain a high alternate forms coefficient whether a test is 
scaled or not. 



Other reliability procedures concern the stability 
of a test. Typically a measure of stability, test-retest 
reliability, is obtained by correlating the scores from two 
administrations of a test, the second administration follow- 
ing an interval of time* The circumstances surrounding the 
tiae wtmwi the scaled tests were administered prevented any 
attempt apply the test-retest procedure. Since the 
sealed tests had to be adminlefetssr^ad at the end of the school 
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chapter) , no time remained for retesting. An approximition 
to test-retest reliability was obtained, however, by employing 
the split-half reliability procedure. While providing an 
indication of the stability of the pupils' scores, such a 
procedure did contain some possible contaminating factors. 
According to Guilford and Ebel such a procedure functions 
best when the items of a test are of equal dif f iculty • The 
scaled tests, however, were purposely constructed to have 
items of unequal difficulty. Further, because of the restrict- 
ed score ranges for the scaled tests following the even-odd 
split, the largest range being eight, the coefficients would 
probably be underestimated. 

In addition to the split-half procedure, stability 
for the scaled tests was viewed in another perspective. 

Rather than stability over time, a measure of the stability 
of the scale between groups was considered, i,e,, did the test 
scale for more than one sample, and, if so, was the ordering 
of the items stable between the groups? 

The rationale for determining the stability of the 
item orders between groups was as follows; if an order of 
items is established as scaled for a given group, and if that 
obtained scale is administered to another group for the 
purpose of inferring behavior from total score, the order of 
the items should be the same for the latter group as it 



J, P, Guilford, Fundamental Statistics in Psvcholoav 
gdication (third editionTlew York : ‘“TKGraw-HiTT SooT 
^pany, 1956), p. 456; Robert L. Ebel, Measuring Educational 
Achievement (Englewood Cliffs, New Jerseyl Prentice^ffall, 

Inc. , 1965) p, 343, 
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was for the former* Such a measure was obtained in the 
present investigation. The item orders for the scaled tests 
administered to the Oakleaf and Slckman pupils respectively# 
were obtained. These obtained orders were tested for sta- 
bility with the Spearman, correlation for rank 

differences. 

Similarly# an order of items should remain internally 
stable for a particular group. That is subsamples should 
maintain the same item orderings as the original sample if 
behavior inferences from total score are to be accurate. 

An assessment of this type of stability was also obtained. 

Two groups of fifty pupils each were randomly selected from 
each of the tests based on the combined sample? from Oakleaf 
and Sickman. The tests of the two groups wore then subjected 
to the scaling methodology# separately. The obtained item 
orders for the two groups were then tested for stability 
with the Spearman# Rho # correlation. 

Th'j above stability procedures differ from the 
traditional approach to reliability. While the traditional 
reliability procedures attempt to evaluate the consistency 
of test scores# the stability procedures attempt to evaluate 
consistency of item orders. 

The next evaluation of the scaled tests concerned 
validity. Because of a multitude of types of validity# an 
enumeration and discussion of all types will not be presented. 
Rather# the present study will be limited to discussing the 
desired outcomes of the scaled tests and how each of these 
criteria were evaluated. 
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The scaled tests were designed to represent skills 

in five areas— -Addition, SubtractiOki, Numeration, Time Telling, 

and Concepts in Money. Tyler stated that one criterion for 

^mlidity is how clearly the objectives have been defined, and 

8 4 

how well the items represent the objectives. While 
validity in this sense cannot be expressed in tenris of soma 
coefficient, this criterion was employed in the formulation 
of the objectives and cosistruction of the items. The 
objectives were defined by employing three criteria suggested 
by Lindvallt®^ 

1. The objective should be stated in terms of the 

pupil. 

2. The objectives should be stated in terms of 
observable behavior. 

3. The statement of an objective should refer to 
tijie behavior or process and to the specific content to 
which this is to be applied. 

The items were constructed directly from the obtained ob- 
jectives. 

The scaled tests were also designed to provide 
scores from which behavior could be inferred. A measure 



Ralph W. Tyler, "The Development of Instruments for 
Assessing Educational Progress," pr oceedings of the 1965 
Invitational Conference on TestingT^roElems cPrincfeon; 
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of the tests' vslidity in this respect would be the degree 
of success obtained when behavior was inferred from test 
score. Such a measure was provided by predicting that a 
total score of n for each pupil meant he had passed the first 
n items « Percentages were obtained for perfect predictions 
and predictions off by one item, and off by more than one 
item. 

Since the scaled tests were also designed to ir:dicate 
the pupils* standing in relation to the sequence of objectives, 
the behaviors represented by the test score should be indica* 
tive of the students' positions in the classroom curriculum. 

In other words, if the test is an achievement test it should 
indicate how well the student is mastering the skills in the 
classroom. With the students at Oakleaf, a daily record 
was kept concerning mastery of skills in each unit of study 
in mathematics. A comparison of the test scores to these 
students' level of mastery in the respective units provides 
an essential measure of validity for the present study. 
Predictions on the basis of scaled test scores were made 
concerning the level and skill attained by each student in 
each of the five units at the time uf testing. The percentages 
of correct predictions, predictions one, two, three? and more 
than three skills off v/ere obtained. 

In order to make the predictions, each behavioral 
objective for the five scaled tests was matched as closely 
as possible to a behavioral objective of the Oakleaf math- 
ematics curriculum sequence. The Oakleaf objectives were 
available in a numbered sequence which was arranged by unit 
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and level. As a resnlt each behavioral objective from the 
scaled tests could be placed in a level at a skill by match- 
ing it to the Oakleaf curriculum objecti\^ae For example 



the behavioral objective, "The student %#111 be able to sub- 
tract two two-digit numerals uas found to 

correspond with the Oafeieat skill t 

in the subtraction \asiit ((Se© C fop a mtmTt desepiption 



of levels A-EI o Since tte; 



'teat Me ire rjiipt 



in the same sequential oirdfet as the obiaistitas ftt® tha 

Oakleaf curriculum (see page 101^ ^ to preiitt ieirsl 
skill from the total scosre th© following ptotodutos were 




employed: 



!• The unit and skill corresponding to the last 
item passed on each test were obtained for each pupil tthis 



unit liras called the "base unit"), 

2, If the pupil had mastered all the tested skills 
in the baa© unit, one of the following criteria were applied: 

a. If he had mattered any skills in the next 
unit his placoiient was predicted at the lo%?@st test--* 
ed skill not iiiiotered in that anitu 

b. If he had not mastered iikillr in the 



next unit his placement mas predicted at ene skill 
above the last tested skill, in the base unit, 

c. If he had mastered all skill© at the next 

unit criteria 1 or 2 mas repeated for t.,he subsec|uent unit, 

d, .For those passing all items the prediction was 
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tested* 

So For those failing all items the predictio^i 
was made at one skill below the lowest level and skill 
tested* 

3* If the pupil had not mastered all the skills 
in the base unit the following criteria applied: 

a* If he had mastered all but one skill of the 
previous unit his placement was predicted at the 
lowest skill not mastered in the base unit* 

b* If he had not mastered two or more skills in 
the previous unit his placement was predicted at the 
lowest skill net mastered in that unit* 

c* If he had mastered all skills in the next 
unit criteria 2a » b, or c were applied* 

To determine the actual level and skill for the 
ptupils in the five units of the Oakleaf curriculum the fol- 
folov^ing criteria were applied: 

1* If the pupil worked in the unit under consideration 
from March on^ the last skill in which he was working was used* 
2* If the unit was mastered r that level was compared 
to the level where he was placed at the beginning of the 
following school year* The higher of the two was taken. 

3* If the pupil id not worked in a unit since 
Marche but did work in the unit during the year this level 
was compared to his placement as in (2) and the higher 
was taken * 

4* If the pupil had not worked in the unit at all 
during the year^ his placement for the following year was taken* 
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The final evaluation of the scaled test was concerned 
with item analysis procedures. One form of item analysis 
concerns the discriminating ability of the items. Tradition- 
items have been selected at the 50 percent level of 
difficulty when it was desired that items make the ma:<imum 
number of discriminations between those passing and failing 
the iter.s. Such procedures are not appropriate for scaled 
tests however, for in evaluation the concern is not always 
with maximum discrimination. Sometimes it is desired to 
know what all the students answer or what only a few can 
answer , or in the case of criterion'^’ref erence measures , 
what each student has mastered in terms of absolute standards. 
As Tyler states: 

Tests that are constructed to measure indi- 
vidual diffei.ence contain a very large proportion 
of items that are at the 50 percent level of 
difficulty because these are the most efficient 
in discrimination. Such tests are Inappropriate 
for the assessment because they do net furnish an 
adequate picture of what is being learned by nearly 
all and by the most advanced. 

Other item analysis procedures commonly employed are 
the selection of items which have the highest correlation mth 
the other items in the test, or items which have the highest 
correlation witi: the total test score. The former is es- 
pecially sensitive to item difficulty, functioning best 

07 

when the items are of equa . difficulty. Therefore, this 
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procedure was deemed inappropriate for scaled tests, since 

the scaled ^ests were constructed to have items of unequal 
difficulty* 

Item-total score correlations, while not as sensitive 
to difficulty as item correlations, also do not afford the 
type of p»"ocedure necessary for scaled tests, items which 
are good items for a scaled test may be judged poor by the 
item-total score procedure. For example, the items at the 
extremes of the scales which most answer correctly or in- 
correctly, will have low item-total score correlations and 
be rejected by this procedure.. These items, however, may 
be crucial in defining a scale. 

Rather than selecting items which best represent 
the total score, the appropriate item selection procedure 
for scaled, tests should yield those items which best re- 
present a scale. If each iter., should represent the scale 
then each item should have a reproducibility of .90 or 
greater, i.e., if there are 100 scores the maximum errors 
for an item should be ten. Consider, for example, a score 
of 4 obtained on a scaled test. Theoretically, all those 
individuals with a score of 4 or above should have passed 
item four* All those with a score of 3 or below should have 
failed item four. The errors for item four can be found by 
summing the number of people who pass iteiis four who score 
less than 4 and the number of people who fail item four who 
score equal to or grater than 4. This procedure was employed 
for identifying poor items in each of the five tests for 
Oakleaf, Sickman, and the tests based on the combined samples* 






52 





I 

o 

ERIC 



cpmi^rison tho;; ;tca.3^d ^ tests,, iiiid, ,|i.. ^ nc»|^*«. 
refersfic^# sl^fids;^^ te»ts» Sincf of 



test the ,sc|iooj# ^f pui»i3. eyslv^tlon 

l::N|en the standa^?dised achievements it was desirable 



to Know how the results of ^he two types tests compared • 
The Metropolitan Achieveinent Jgests » Primary I ( Form B) , 



Frimayy II ( Form A) > tnd Elementary ( Form C) batteries 
were administered to the OaKleaf pupils concurrently with 
the scaled tests* The ccmtpari son involved the following 
procedures: 



1 • The methods employed for the evaluation of tht 



Metropolitan tests were aPPiiod ^c the scaled tests*, ^h^is , 



consisted of split-^half reliability and item analysis based 
on a discrimination index between high and low scorers. The 
sises of the reliability coefficients reported in the Metro* 
politan examiner ^ s manuals , ^ere compared to the coef f icients 
obtained from the scaled tests taking Into consideration 
the contaminating factors mentioned previously. The Metro*- 
politan manuals More not specific C^^hCO^^hlog the type of 
discrimination index employed* Therefore# the difference 
between the upper twentyT-seven percect and loviifer twenty-seven , . 
percent was chosen* , The. items re iec ted t by Precedure 

m 

were compared to those rejected by the suggested procedure 
for scaled: tests. n:c'V.v 

2* I An ^attempt was made to applyst^C fcaling methr 
odology to the items o| t^ Arithmetic gomputitjon^ 
cents and Problem Solving sobtestSf Oli the Ifeticpolitan* 




obj«ctiv4ii^i^#^*fii^iito^^-‘‘ W mi ti«si'^^- . 

r««ponda#"tb'lii'*bbiJiii^l^ Hr* 

Thase it^a and'bbl)tl^£Vadt‘‘!^t(ma^ ttiW alianbbi'%rilo'^i^^ 

taats coifraipoitdlh^ "iM'’'tha‘°llba %dMled '€alta: ' "Additibb, ^ ' ' 
Subtraction, MbnOratibn,' ‘l^ya Tbl aiid CobOapba in'itbney. 
Itama froni tha MtetyiOiitkb wh did not fail into oiie ol 
tha above five eategOriaa Oebe'^ m The aame criteria for 

scalability which applied tb the scaled tests were allied 

tb these ■ 'five "detiwd' ■. 

3* detsrniiiie if a particular raw score represented 

the same items bSin^ answereid dbrrebtlyp the raw score patterns 
for the scaldd tests ahd the deriVidi Metropolitan arit^etic 
tests Were obtained a For the fletrbpblitah the contrived items 
vrexe redubed to their bortespondihg individual items since the 
compariioh involved raw scores; The raw score patterns for the 
five' tests were ' cbmiiered ' tb"tke"'^iw^' scote patterhs'’''bf'^ the" 
scaled tests by competihig the bverege nuiiiiher bf score patterns 
pit' test'k^ ' 'U-.- - 5 ^ • 



prbbedufi for ebap^tihg the avetage of 

pet tes¥"ir'Ss'"fbllbib^ 



ucr 



n'' f *‘ 









sbbf el p'" sero'^ scores , "or^ scores 
osb-i^f sbh WHi'" oimitted^ bicsyie ' bhl^'bhe" sebrIP 
pbtterh' wlt^'^pbisiblia'^^' Thw’ averages" ISre'bb^ 
tbtailhb'the hbiid>it%f ^'ieore pitterhl^'fbr^'^i''' teat Slid 
divldihg'^-iy^^'ttfi^^sialtit'^ot^ the f oltbwln|f'-'t^ 
thi'^- tbtit ^^-niihei^bf ^ people "fepi$iihlid^% tke’^bit^rns 
bi thW'Mxliui iiiber"of^^ 

The latter value was necessary because 
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.|KHf .inny ^rh%^,maxiipm,..^ 

ber of pattern was obtained by finding the combina* 
tion of n things taken r at a time. For example# 
test had only three items the maximum number 
of pmtierhs would be six# tbe^^^^^^^ 2 and 1 being 

tHe bniy scores considered. 

4. the score patterns of the scaled tests and the 
score patterns of the derived Metropolitan scaled tests 
werd compared as in part i above . 

Predict ions from tile total raw scores of the 
derived Metropoiltah tests were made concerning the level 
and skill of each pupil in each of the five units. Since 
the predictiohs were based bn raw scores# the contrived 
items were again reduced to their components. With one 
exception the prediction foliowed the same procedures and 
criteria employed in the validation of the scaled tests. 

The exception involved the determination of score ranges for 
the Metropolitan tests, since no contrived itenis were em- 
ployed# many items of the Metropolitan tests pertained to 
the same bfcjective.^^ T^ a range bf scbres# correspond- 

ing to the number o>i items testing a given objective# was 
establiihed for the puri^se of pred the pupils* positions 
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The methodology for the construction of sequentially 
scaled achievement tests was applied in the development 
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of tests in five areas of arithmetic achievement: Addition, 

Subtraction, Numeration, Time Telling, and Concepts in 
Money* Subjects were obtained from two elementary schools, 
Oakleaf and Sickman, in suburban Pittsburgh* Following the 
construction of the tests, three phases of evaluation were 
attempted^ — reliability, validity, and item analysis* The 
following procedures were employed: 

1* For reliability (a) the split-half (even-odd) 
correlation was obtained as an approximation to test-retest 
reliability, and (b) Spearman Rho rank order correlations 
were obtained for the orderings of two random samples of 



pupils selected from the combined Oakleaf and Sickman pop- 
ulations* Rank order cor re la tio^^^ were also obtained 
for the item orderings of the Oakleaf tests versus the cor* 
responding Sickman tests* These provided a measure of the 
stability of the items in the scales* 
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2* The essential measure of validity was provided 

by the agreement between the scaled scores and the Oakleaf 

nmd& t& wtmthat tim 

student's level of mastery in each of the five units* Pre- 

dictions were made, on the basis of the scaled test scores, 

y!- V r hm-:> t 'i ^ t o-.!: !i!cS.£SE£i; 4i£ll 



b«- In «aeh .of th« "five 
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£iii'dity (bt^ each'" scaled test was 
also obtlined"'hy^dlk|h^ihf £he' |^rcehtage"‘^bf tiihes'''^b tbtil 
scb£d^"'bi' k" £e'|)£Siehl4d M ^ stddeEt'*' S' ' jpassing^'^he ' f i£ dt h' ' ■" 
iteniil Perf edt tbiixSiehtiatlMi , hlit 'thoie''' dhe ' i tefe" ^ pf f f or 
n^i"' £hih oiid itiiif“'bf%^^w4£e ohtaihed'r 

'3#^'" The"' iteir dhiiyil's^'ibibcediBt^ Involved establishing 
a minimuir reproducibility of .90 for eaph item. The 
maximum nu>:^er of errors for an item was ten percent of the 
number of people. Errors were obtained by counting those 
persons having a score lower than the item number but 
passing the item, and those persons obtaining a score equal 
to or greater than the item number but failing the item. 

The final phase of the present investigation in- 
volved a comparison of the scaled tests results to the 
results of the Metropolitan Achievement Teats administered 
concurrently to the Oakleaf pupils. The comparison involved 
the following procedures; 

1. The methods employed for the evaluation of the 
Metropolitan Tests were applied to the scaled tests . Re- 
liability and item analjysis procedures were compared. 

2. ibt attempt was made to determine whether the 
scaling methodology could be applied to the items of the 
arithmetic computation subtest of the Metropolitan 
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th« ping eorrectiy tp neorn pnttnrni 

tM., ictXnd' p«ti «nii ..nnd^^ i 

derived Metropolitan were coiipared by nnmber of patterna for 
individual and number ^f patterna for teat* 

tifr is# t0^t^iks^ i-'if- # js'J 2 # 4 #s 

4* The raw aeorea of the five teat a derived from 

O*: l«@lki#XG;r#;l Wil^h t& pliv#fep#4 

Metropolitan were uaed to infer behavior from total acore* 
The predictions and percentagea were obtained in the same 



manner as the scaled testa* The resultant percentages provided 
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a measure of the relative success of both tests. 
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the scaling methodology £oir the present investi- 




. 1. ’‘''Identilldation of ;the^^i»iOa% behwioral 
jective to tested# followed hiy identificatidh of a series 
of behavioral objectives which appear logically to preceed 

'-iff ■!?*■■■ -:'v ' r " ' 

the terminal objective in a sequence. 

2. Construction of items corresponding to each 
objective. 

3. Combination of the items into "contrived items." 

4. Establishment of a criterion for passing each 
"contrived item. " 



5. Administration and scoring of the tests. 



6. Application of Guttman's "scalogram analysis" 
technique including computation of the reproducibility 
coefficient. ^'C; . 



7* Ccnnputation of M coefficient of seal-- 

ability. 

For both the Oailea^ and f»ickiian tests in initial 
reproducibility coefficient was calculated. The items were 



then rearranged# where necessary# to obtain the miximum do- 
eilicient of reproducibilily. The coefficient of scalability 
wai CMi^ted following the ^f inal arriiigement of ttie iteiis 
"(iei Appendix for- a Saiii^ie Scalogflil. The Obiained rj«^ 
*prSducibility 'aid scalability eoefficiints are''|^risented''iii 
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Revised 
' Seal. ' 


Subtraction 


• SIO 


•itva"- j V “f'si '. 

.966 


. 815 


■■ ■ ■. ^ r Bi . -f , 

^ .961 


.791 


Addition 


• 961 


.979 


• 854 

>? ' ■;«- 


.977 


. 825 


NuMration 


.957 


.965 


• 682 


ii ’ ■- '- 1 ‘ ■ >: r‘* 

.947 


.647 


Money 


4943 


‘•■J- ■■ ;' !•} i '.i'."^ 5 ' <’ :" y v*’i - 

.953 


• 690 


.951 


.676 


Time ' 


.907 


.929 


.711 
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• 920 


.676 
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for revised reprodOOibiiit^ and^rovliad soaiabi^^^ coef f i- 

oients w Theaa raviaioini because of the number 

of perfect scores occurring, wi^^ a perfect score the repro- 
ducibility has to be perfect and# therefore, a group of 
perfect scores will spuriously raise the reproducibility of the 
test as a whole. While the coefficient os scalability will 
insure against spuriousness in the test, it too must be perfect 
for all perfect scores. Therefore, all but one perfect paper 
was omitted from each test, the one perfect score remaining 
to give the test a ceiling. Reproducibility and scalability 
coefficients were again computed, and these coefficients 
constituted the revised coefficients appearing in the tables. 
These revised coefficients were established as the criteria 
for accepting or rejecting the tests as scales. ' 

From the values presented in Tables 1 and 2, four of 
the five Oakleaf tests and four of the five Sickman tests met 
the criteria for being a scale. Because the scalability coeffi- 
cient was below .65 the Oakleaf Numeration test failed to 

f 

meet the criteria and was not considered scaled for this group. 
The Oakleaf Money and Time tests barely reached the minimum 
value for scales, and while accepted as scaled, were in need of 
further revision. The omission of some poor items (to be 
discussed later! and/or the addition of new items is suggested. 

The Sickman Time test also; failed to meet the scalability 
criterion and was not considered spaled for the Sickman pupils. 
The remaining four tests had substantial scalability coefficients 







in cbntriist to only two inbataniial scalability coefficients 
for the Oakleaf tests* The difference between the Oakleaf 
and Sickman Numeration tests may be a function of differential 
famxliarxty to Numeration content-* The lowest score on 
the Oakleaf Numeration test was 7 while there were nineteen 
scores below 7 for the Sickman Numeration test* It would 
appear that the Oakleaf students had 9 freater familiarity with 
the Numeration objectives tested, the test being too easy for 
them. As a result the objectives and corresponding items 
were not seguenced once the students had become familiar *^ith 
them. The Sickman pupils who scored lower than 7 evidently 
had not had the same amount of familiarity with the Numera- 
tion objectives, and could not attempt the higher level items. 
Consequently, the opportunity to have high reproducibility 
and scalability was greater for these items* 

The same conjecture could not apply to the Money 

( 

test, however. The score ranges were almost the same for 
both schools* Perhaps the best suggestion is that the Money 
test functions differently for the two groups, and that it 
best represents the ordering of the objectives of the Sickman 
curriculum* ' 

Following the attempt to scale the tests for each 

1 . ■ 

school separately, the two samples of pupils were combined 
to determine if the tests would scale for the entire group* 

The reproducibility and scalability coefficients are re- 
ported in Table 3 for the final ordering of items. Revised 

/,) - 

reproducibility and scalability coefficients were again 

.•_> . . I- 

calculated by omitting all but one perfect paper from each 
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test. These revised coef f iciehts were employed ss the criteria 
for considerihg a test as scaled. The results show that 
only the Time test failed to meet the criteria and was not 
considered sealed. The magnitude of the scalability coef- 
ficients for the Numeretion aiid Money tests was undoubtedly 
enhanced by the Sickman scores. 



TABLE 3 

REPRODUCIBILITY AND SCALABILITY COEFFICIENT 
FOR THE COMBINED SAMPLE OF PUPILS 





Repro. I 


Scalability 


Revised 

Repro. 


Revised 

Seal. 


Subtraction 


.955 


.778 


.946 


.739 


Addition 


.973 


.837 


.916 


.776 


Numeration 


.962 


.715 


.943 


.715 


Money 


.958 


.733 


.955 


.712 


Time 


.917 


.669 


.907 


• 630 






B. Reliability of the Scaled Tests 

Having applied the methodology for the construction 






of scaled tests « the resultant five tests were subjected to 
evaluation procedures in the areas of reliability, validity, 
and item analysis. The initial measure of reliability in- 
voived the computation of split-half reliability coefficients 
to afford an indication of the stability of the test scores. 
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Because the final orders of items pi each test were arranged in 
ascending order of difficulty an even-*odd split was chosen. To af* 
ford an adequate sample size the correlations were computed 
from the combined Qahleaf and Sickman test results. The 
correlations, appearing in Table 4, were corrected by using 
the Spearman-Brown formula. 



TABLE 4 



CORRECTED SPLIT-HALF RELIABILITY 
COEFFICIENTS FOR THE SCALED TESTS 




Test 



^tt 



Subtraction .872 
Addition .908 
Numeration .931 
Money . 852 
Time . 787 



From the magnitudes of the split-half coefficients 
the tests appear to have fairly stable scores. The Time 
test had the lowest coefficient, a result which indicated 
some lack of stability but a result not unexpected since the 
Time test did not meet the scale criteria. When the con- 
taminating factors of the split-half correlation Are con- 
sidered, the above results are most encouraging, for with 
varying difficulties and restricted score ranges the 
coefficients may be underestimated. 
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While the spiit*-helf reliability afforded a measure 
of the stability of the soores# a iieasure of the stability 
of the item orderings was also computed. Two assessments 
of the stability of the tests* item orders were employed! 

( 1 ) the stability of the item orderings between the two 
schools and (2) the stability of the item orderings for two 
randomly selected groups from the combined samples. The 

f 

Spearman Rho # rank-difference correlation was employed for 
both assessments. 

The Spearman correlation coefficients for the item 
orders between schools appear in Table 5. The Subtraction 
and Addition test items had stable item orders. The Time 
and Money item orders were not as stable ar|d indicated icm 
fluctuation. A coefficient as Isz’ge as .880 was unexpected 
for the Time test. The o>rdef df iteps remained relatively 
stable syen., though the^ 4id not scfale... The^..lh3fltration 
item orders were not yt|d>|p., .fhis resttl«h«,, however# reinforces 
the. conjecture qohce|:ning; ,;^is tesb^ items 

did.not have a scalphle ordering for Oak leaf but did for 
Sickman where they were more difficult. 
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FOR 



CORRiLAf fOR COBFFlCXBlif 8 
MD 8X0101101 XfRN ORDSI^ 



Test 


Rho 


subtraction 


• 946 


Addition 


• 932 


Nuneretion 


• 736 


Money 


.846 


Tine 


• 880 



A second essesssielht of the stebility of the iteis 
Wes mde by iieiectih# Iwb rendom sen^lei of f if ty puj^iXs 
f ron the tests beied oh the conbined seni>lhs of Oakleef 
end dickhen. Speetinen Rho cbrrelitions were obtained for 
the two senpies end 'ire preiented in fibXe 6. ' the reitt;iiti 
iiere thet ihe iteR orderings ire stible within eny of the 
five ' c<M0>ihid ''teste.’ 
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TABLE 6 

SPBARNAN RHO OORRBLATXON COBPPXCXBNT8 FOR 
XTBN COUIBRR OF TlfO RRllDQM SANFLB8 FROM 
THB TB8T8 BA8BD Oil COMBXNEO 8ANPLB8 



i 'i- 1 J,>: 



,\n_ 



, Test 

v: 






. 1 / 









Rho 






4 '-L.::-'. 
















i,:' 
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Mdition . 


of 




.982 


Nuneretion 






.930 
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TABLE 6 (contirmed) 

SPEAIVCMI RBO CORIU^LATldil FOR 

ITEM ORDiR? OP #0 RANDOM iSA^LEB PROM 
THE TESTS BASED ON COMBINED SAMPLES 



Test 


Rho 


Money 


*930 


Time 


*932 


Validity of the Scaled Tests 



The second phase in the evaluation of the scaled 
tests concerned validity* One of the validation procedures 
was the measure of the degree of success obtained when be- 
havior was inferred from the total raw score of the scaled 
tests* Theoretically, a pupil obtaining a score of n should 
have answered only the first n itcmts correctly* Percentages 
for each test were obtained for the number of times the total 



score equaled the first n items passed* Percentages were 
also obtained for the number of times the total score was off 
by one item or mori tilatt one item in representing th# fiirst 
n items passed* Perfect scores %iere omitted from the calcula- 



tions since t^ose students had not reached a point at which an 

resultant percentages for Oakleaf are pre- 



snn^t^ J-n Table 7 • The Sickman results are presented in Table 

* '' - -. 1 ^ ■ ^ i.V 

6.^ hJ^ter the two sgl^Mls' samples yere combined, the item orders 



w§tf reanalysed* The, results for the corobindd samples are pre- 
seated in Table 9* The number in parentheses following the 
title indicates the number of scores employed in the calculations 



67 



TA»|« 7 

PEroSHTAGBS OT iCORES jEQUAI.IH6 THE 
^lEST n ITBIffi PASSipp (PAKAEAP) ' 



PttrcQiit 

P«r|fq^ 


Percent 
One Item Off 


Percent 
Greeter Than 
One Item Off 


Subtraction (77) 59.7 


37.7 


2.6 


Adjditlon (72) 70.8 


25 • 0 


4.2 


Nupe:i^ation (54) 53 .3 


57.4 


9 . 3 


Money (79) 


32 .9 


12.7 


Time (74) i7*5 


45 •9 


36.5 







tabijb; 8 
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PEECBNTA6BS OF RAN SCORER EQUALXIIG THE 
FIRST n ITEMS PASSED (SICKNAM) 


■■ ' 




~TI5v'' r. v-.r '^nr >■: . 


Test 


Peiieeiilf 

Perfect 


Percent 

Percent^ ' ’ ' ' Gi^ter Then^ 
One Item Off one Item Off 

iv,, J’ ' li" S' ■ ~ ■' ■ Vh ■' 



Subtraction (57) 


61.4 


31.6 












Addition (50) 


40.0 




10 . 0 . 


■•4 j ■' 4 ; 44 -.^4 j L4 v/::': ' '4' 4 
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Numeration (47) 


42.6 


46.8 


10 . 6 . 
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'"4 


Money (67) 


64.3 


28.3 


7.4 


A 1 t ti 


4K. V. 




v'v StFS:;:; 


Tiaie (70) 


17ol 


34.3 


48.6 
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f ABLE 9 

PERCBNtAGBS OP RAW SCORES EQUALING THE 
FIRST n ITEMS PASSED (COMBINED) 



Test 


Percent 

Perfect 


Percent 
One Item Off 


Percent 
Greater Than 
One Item Off 


Subtraction (134) 


45*6 


50.0 


4*5 


Addition (122) 


. ^7.4 


36*9 


5.7 


Numeration (101) 


30.7 


58*4 


10*9 


Money (146) 


56*8 


34*2 


8*9 


Time (144) 


14*6 


32.6 


52 * 8 



The results showed that for every test but Time the 
total scores were withih a marinum of one item off in 
having the total represent the first n Items passed for 
approximately ninety percent of the cases* These percentages 
are closely related to the reproducibility of the tests » 
since the firrors employed in determining the reppodugibility 
coefficients were the spM errors made when the total score 
did not equal the fiPat n itaip passed^ 

The Time test, having the lowest reproducibility in 
all cases was also tlie poorest in this phase of validity 
evaluation* The scores did not cosm within one item of 
equaling the first n items passed in Mre than fifty per* 
cent of the eases for Sihkman, and more than sixty percent 
for Oakleaf . When the schools wm cca8>ined the Time scores 
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w«re succ«ssftti In bnlng within onn it«m in just slightly 
mors than fifty percent of the cases. 

The final assessment of the validity of the 
scaled tests, and psrhaps the roost crucial for the present 
investigation> was how well the tests' results indicated 

t' 

the position of the pupils in the classroom curriculum 
sequence. In order to determine this type of validity a 
daily record of the pupils' positions by skill and level 
in each unit was required « Such a record was maintained for 
the Oakleaf pupils but not for the Sickman pupils. The 
results, therefore, pertained only to the Oakleaf tests. 

On the basis of raw sqore alone, and following the procedures 
outlined in the previous chapter predictions were made for 
each pupil concerning the level and skill at which he should 

•f. ., 4 ., 

be working in eech of the five tested areas. 

• - ■ 

PercentSges were obtained for the number of times 
the pupil was placed at the exact level and skill or at 
one, two, three or mote skills off. The obtained percentages 
are shown in Table io . The number in parentheses beside 
each te|t name indicates the number of pupils who could be 
place with a di^ree of accuracy. In a few instances pupils 
were no longer in the school and no placement data for 1965 
was available. 

■ ? 

■' 



I 







► 





M 

s 



«4 

H H 

3 P M 

0* " ^ ,iK 

^ §8 

O . (Qi 

'■" « J 

M y ai 




S Of4 
M CO 
H CO M 

@ ^ 

- ' 

1^ M 

^ 0 

O H 






7i 

The results indicated that in four of the five 
tests at least eighty '^he Oakleaf jpupils were 

placed within a ntaximum of three skills from their position 
in the curriculum sequence. Approximately one third to 
two thirds of the pupils# depending on the test, were placed 
at a maximum of one skill off. 

The validity of the Addition test was poor, es-* 
peeially in comparison to the other tests. Since the Ad- 
dition test had ranked so highly With respect to the other 
evaluation criteria, the result was surprising. The result 
should have been expected, however, when viewed with the 
strf^qture of the curriculum sequences for the five tested 
areas. The information provided in Table 11 indicated 
the number of skills in the Oakleaf curriculum at 
each of the five levels tes^ted in each unit. Table 12 
furnished the ratios of the nund:>er of skills in the last 
three levels (C, 0, and E) to the number of items in each 
respective test. The reason for el im^ levels A and 

B was that when th^ pfedidtions w^^ 'made’^^^ determine the 
validity of the tests, only twenty-one of the 416 cases 
were actually in level B and none were in level A. 
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NUMBER OF CUR|t|ClJLUM SKILLS PER 
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Tested Units 


.. '/ ■ ■' 

A 


Levels 

r rC '" -D > 


E 


Total 


Subtraction 


; 2- 


6 - 2 ■ 3 ■ 


3 


16 


Addition 


2 


8 6 5 


7 


28 


Numeration 


10 


9 4 2 


3 


28 


Money 


1 


3 2 5 


3 


14 


Time 


3 


4 7 3 


2 


19 


TABLE 12 

RATIOS OF NUMBER OF TEST ITEMS TO 
OF CURRICULUM SKILLS IN LEVELS C,U 


NUMBER 
># and E 




Tot 


ai~ 'Skills Number of 




■■ ■■ "i 


Te«ta c 


1 A # 


.E.,. :., Te^t. It;e»li : :■ 

^ il: ^ ^ 


.Items/Skills 


Sdbli^tJ^ 








1.38 




. Mw. 


. f'.V 




.78 




: e?:' 
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1*5l6 


-i. % V' i;- s. 


.';L 
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' 


1«2P ^ 
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1.33 
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From the results of these two tables the Addition 
test should have been the least valid , it employed slightly 
greater than half the number of items as the other tests to 
evaluate a skill. In other words a smaller number of items 
were employed to test a wider range of skills. While , on 
the average# each skill was tested with an item on each of 
the other tests# more than one skill was tested with an item 
on the Addition test. 



D. Item Analysis of the Scaled Tests 

The final evaluation procedure suggested for the 
scaled tests involved item analysis. The scalogram analysis 
procedure actually included an item analysis technique. 

Since the minimum reproducibility for a test was .90# the 
criterion adopted for the analysis of the items was that ea ?h 
item should have a minimum reproducibility of .90. Rather 

a 

than computing reproducibility coefficients for all the items 
the maximum number of errors for an item were found. An 
error was counted when (1) a correct response was made by an 
individual to an item whose number was greater than the total 
score obtained by that individual# and (2) an incorrect 
response was made to an item whose number was lower than the 
total score of the individual. Any item whose errors exceeded 
the maximum number of errors was considered as a poor item. 

The procedure was applied to the Oakleaf # Sickman# and combined 
tests. The results are shown in Tables 13# 14# and 15. The 
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nunbttr in parnnthnsns after each teat name indicates the 
maximum number of errors for the items of that test. An 
item number superscribed with a bar signifies a poor item. 



TABLE 13 

ITEM ANALYSIS FOB OAKLEAF SCALED TESTS 



Subtraction (8) 



Item No. 1 2 

Errors 2 1 

- -t, ■ ' ' V 

Item No. 1 2 

Errors 0 0 



3 4 5 1 8 . f 10 11 

7 8 4 ^2 4 6 10 8 4 

Addition (8) 

3 4 . J ,,8^ 8-10 11 12 13 14 

1 0 2 4 4 i4 S 7 8 5 5 3 



Numeration 



Item No. 
Errors 



Item No. 
Errors 



1 2 

iQ Q 



1 2 
1 4 



3 

1 



3 

8 



4 

I 

4 

5 



5 

3 



5 

f 



e m 7 v8 ; 8 10 TT TI II I? 

... ^ ^ 

4 4 3 « 19 10 10 10 20 

Honey (8) 

5 ;, It; ff .? Iff 11 12 

$ 15 19 11 14 4 ro i 



L V..V 



Time 



'4 



n 



itmm No. 1 2 3 ? ff ff 7 8 SlfflTIIIirrifflff 



■ . Vx : jwiic: '.i i'v. 4 % i , 



Error. 1 3 8 13 17 17 15 5 9 18 15 25.12 10 12 12 

'.ra.aw;x .■ .xribfS'S t; isatr ;vssj,s' 



*An it.a number superacrlbed with a bar algnifie. a poor item. 
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TABLE 



ITEM ANALYSIS FOR SICKHAN SCALED TESTS 



I i^ . 4' '> 



■n, '• I'.i 



■ ' > 



y y. i . u 





























f 




1 1 


Subtraction (7) 

y '■" ’‘■S' ‘ ‘ 1. ?« 


)• "‘ V. ")> 


■fi p 






itpm No. 


\ 


2 


3 


4 


5 


« 7 8, 9 10 IT* 








Errors 


2 


5 


7 


4 


5 


6 3 3 7 4 


8 








y-. 


'i 










Addition (7) . 

iy. 1 -I 9 >' 


t :' • 








I^SIII..Np* 


1 


2 


% 


A, 


5 

'■■A 


6 7 8 9 10 IT 12 

,4.,€ i 'y y y fi ' ' i f . ‘ 


13 


14 




Errors 


0 


0 


0 


4 


3 

W 


5 6 6, 6 7 

xcn‘n\ U'.} 

IsWMfeA V* A 4* 4 i ^ \ 


19 1 


5 


7 




v/-' ■'"'’«■■' . 

.;'\v ; * 1, '\\i L . r;.v 


1 


A 


A 

3 

A 


4 


5 


iiuiiiiBiri|cj.pn V ' 1 

^ .i:-, S 

J 7 8 9 10 

:h i <' ’i} ii ?■ ' 


11 12 


13 


14 




Errors 


1 


6 


7 


2 


4 


9 3 5 6 4 

‘'v-; 4 j ; ■ : 


5 1 


3 


3 






'.l 

■f. 


1. ;> 




4' 


y ' 


Money <7 L , 

Vi. ?' M % f;- 


4 t- ^ 






! '" ' 




1 


2 

' ,T,- 


J: 


4 

,'S 


.5 


6 7 f IQ) 11 12 

i ‘f V^3 ■:■: 






'• '■.' 


Errors 


i 


i 


3 


7 


2 


7 6 8 9 9 


5 0 




















-- 0 . -•■>;■.• C'-t'v' ' . ii-;- -■■ i 

Time (7) 










Itsm No.,, 


‘.(As: 


1.:,. 


.Jv 


¥ 


? 


z Iff T nr IT IT IT 


14 






‘A 


A 


6 


M'. 


12 


17 15 12 13 25 

.ilib ■ 

,. - *» -Ji-rv 'S'y?! • 


30 28 

tv. . 


11 


6 


8 5 


■ 1^' ' ■'■ - -yi" 




















1 

1 


ipM 


^•r 




Il^ei 


:SD> 




Cyflga; 


itid 




|?;|lpptw^ 




SS-'-A' 


ifil" 








|v^‘ ;|fe;4is 


~\y 


-. . ■ >'■ . '•’ 








■:~'n '■ it 
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Sv:v ’ 
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. TABLB 15 'rK; -:Ut;GC 

ITBM AMMiYaZS *0R COMBINBO SCAl^D fSSTS, 




V.' v; ' 




-ji 


■s ;-f '• 'd ■ ' 


‘ Subtriiction" (16)-'' ■ 




i teiiii No . 


1 


2 


Uy. 4 


F 6 7 8 T IFIT 




"Brrors 


4 


6 


14 13 


17 13 7 9 38 35 18 












Addition (15) 




Item No. 


1 


2 


3 4 


5 6 7 8 9 10 IT 12 13 14 




Errors 


0 


b 




5 11 Id 10 10 13 27 6 11 10 








- - ■■ 


^ • 


Muiiier«tlon (IS) 




Item No. 


1 


2 


3 I 


5 6 7 8 9 IF 11 12 13 I¥ 




Errors 


0 


9 


8 17 


4 12 15 10 10 16 14 12 14 23 








'■ , 




Money (15) 


* 


V (>. } V i'' :V i' ■ 'i' 1 " 

Item No . 


'l 


2 




5 F 7 f y IF 11 12 




. Errors 


3 


10 


9 7 


9 19 19 21 31 23 9 0 




v.»„ ■;. 


r. . 

'j:' - 


' ' ^ ''V 

.■dJ-i' 


i' J 


Time (16) 




Item No. 


1 } 


.-^2: 


3 I 


F F 7 F F IF H II IF 14 IF IF 




» c: '>u ^ ;(i; 

: Errors. 




•i i > 

6 


14 26 


28 29 36 41 33 43 48 47 23 16 17 18 












■ 








■ M V- 



;j j UQiil.y the Oakieaf Addit4.on test conteliied no poor i 

£‘ ic^■ 4 e ’/? <3 *■' ■’.'i i a;-' 

■^i,t«PS.iio;t’.thiis analysis. The .A«3Wition^ testv ii* a .tdiolo 
had good items, only one item was poor for both the Sickman 
and combined tests. The Sickman Subtraction and Numeration 



tests wewisthOi^oiiljyf^sitli^i?^ testst.:toiilMiiii<^i^ tOUP .poor 

^4^m#'^>d8heL: ygi eii ^ wp ^' it^ii|i,:^^biat.,L#ie 

r. 
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ppoic outnuibtr«4 %htrgop4 itmi thxmm to ont* 

To OotoTiiiiio if the poor itoms woro the eane or 
variable for the Oakleaft Sielouin and ooittbined teat* the 
itema were renumbered according to their original order— 
the order in which they appeared when the teata were ad- 
miniatered# The reaulta are preaented in Table 16. 



TABLE 16 

ORIGINAL ITEM ORDERS FOR POOR ITEMS 




Teata 



^Original Order 



Subtraction Oak leaf 5 $ 

Subtraction ‘"Siclbna^^^^ 5 

Subtraction Combined 5 6 7 11 



Addition Oakieef 
Addition Siditmah 
Addition Combined 

Numeration Oakleaf 
Numeration SickmAn 
Numeration Copbined 



No poor itema 
9 




12 13 14 

8 

11 12 



' h :i 



Money; pakleaf ... , 
Money sickmah 
Money , Combined 

Time; Qakleef , . 
Time SickiAh " ' 
Timo Combined u „ 






5 7 8 9 10 

5 8 10 

.I-.# ^ M -• 

, « 5t ,8 ,7 8 9 10 11 13 14 15 .16 , 

4 5 6 7 8 9 10 11 12 13 15 

.4 5 5 7 8 9 10 11 12 13 IS 16 









waa variatidh, however 
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between the schools # for the poor items were never ail alike 
for any of the tests* 

The Numeration test represented the greatest amount 
of variability with seven of eight poor items being different 
for the three test situations* The results also showed 
that four items were poor for Oakleaf and only one item was 
poor for Sickman* This affords further information con*- 
cerning why the Sickman Numeration test was scaled and the 
Oakleaf test was not* 

E* Comparison of Scaled Tents and Metropolitan Achievement Tests 

Following the evaluation of the scaled tests, 
certain comparisdns Were made between these tests and the 
Me tropo 1 it ah Achievement Tests * The comparisons were de** 
signed to investigalte similar it ie£s and differences in the 
evaluation procedures and results of the two types of tests * 

The comparisons included the following: 

/ 

1* The initial comparison was an attempt to apply to 
the scaled tests the procedures employed in the evaluation 
of the Metropolitan tests* The examiner's manual of the 
Metropolitan described two evaluation procedures: (a) split- 

half reliability and (b) item analysis involving a discrimina- 
tion index between high and low scorers* The split-half 
coefficients for the scaled tests, obtained previously, are 
presented in Table 17 along with the coefficients for the 
arithmetic subtests of the Metropolitan test batteries* The 
latter were taken from the examiner ' s manuals and are ex- 
pressed as ranges of coefficients* 
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tKK* 17 

SPLIT-HALF HILIABiLm OOimCXIIRS FOR 
SCALBO TESTS ADR METROPOLITAN 
ACHIEVEMENT TESTS 



t«st 


'tt 


Sciile Subtraction 


.872 


Scale Addition 


.808 


Scale Numeration 


.931 


Scale Mondy 


.852 


Scale Time 


.787 


Metro. I Arith. concej^ti 


.81 - .89 


Mettb. I Arith. Skilla 


.94-795 


lieliro. 11 Abith. ^bneepte " 


.80 - .87 


Metro. XI ' Computaiioh ' 


.74 - .88 


Metro. Ill Arith. ''Concept a 


'A 

•S6 — .91 


Metro. Ill Arith. Computation " ' ' ' " ' 


'i91''- .R3'''^ 






Vi if L:' 










""4 
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Th« co«ffici«iit values for both types of tests are 
similar. Of the two contaminating factors for split-half 
coefficients 9 only the range of difficulty was a factor for 
the Metropolitan 9 and this factor undoubtedly not to the 
fullest extent since easy items for all pupils and difficult 
items for all pupils would have been rejected in the item 
analysis. Therefore 9 the scaled test coefficients may be 
underestimated to a greater extent than the Metropolitan 
coefficients. 

The second evaluation procedure 9 item analysis 9 
was compared by applying the method used in the evaluation 
of the Metropolitan to the Oakleaf scaled tests. Tne 
manuals for the Metropolitan were not specific concerning the 
particular high-low discrimination index employed nor the 
criteria established for rejection of an item. Therefore 9 
percent difference was obtained and items with a difference 
of thirty percent or less were rejected. The item analysis 
is presented in Table 18 . Items superscribed with bars 
are poor items. The number in parentheses after the test 
name indicates the number of persons in the upper and lower 
groups 9 respectively. The similarity of this item analysis 
procedure to the scaled test item analysis procedure can 
be obtained from Table 19 . 
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ITEM DESIGNATED AS POOIk BY TWO ITEM AHALYS IS PROCEDURES: 
THE SCAI«E PROCEDURE AMD UPPER-LOWER 27 PERCENT 



Scale 


Subtraction Item Numbers 
6, 9 


Upper-Lower 


If 2, 3 


Scale 


Addition Item Numbers 
^ None 


Upper-Lower 


1 , 2 , 3 , 4 , 5 6 




Numeration Item Numbers 


Scale 


11, 12, 13, 14 


Upper-Lower 


1,2, 3, 4, 5, 6, 7, 8, 9, 10 




Money Item Numbers 


Scale* ' 


6, 7, 8, 9, 10 


Upper-Lower 


1, 2, 3, 4 


■? 


Time Item Numbers 


Scale : 


1, 2, 3, 5 



Upper-LoweY 4^ 5, 6, 7, t, 10, llf 12, 13, 14^ 15, 16 



<: Thes« rssiil^P: indicated , t^i^t ; . th$, l^eins, ^ re|#c^d ^Y 
the tm proq«dur»9 w«re Alpoft^ oonplttely^^^^^ Iiv each test 

the I uppers Ipwey ; ^method.,, rej ected..: tha.^i ^Ire peverfl • These 

items weYe.reasY' iit 9 ms .IdjCe-iipsti pY, thm pup^^^ 

tested. ‘'behavioral, ;#hjectlwei9ppe;ai^i^^ eerly^^ihpjthe oorriculum 

sequence • While the upper-^lpwer itep; analysis rejected 
these easy items because they did not discriminate among the 
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high and low scores^ the scaled test method did not reject 
these items because they were highly reproducible and pro- 
vided an assessment of the pupils' mastery at the lower 
levels* Items chosen for a scale start with the very 
easy items and progress to the most difficult. Each of the 
items selected by the scaled test item analysis should 
best constitute a scale* Each of the items chosen by the 
upper-lower twenty-seven percent item analysis should 
discriminate between high and low scorers* The type of item 
which would be rejected by both procedures would be the 
item answered correctly by low scorers and incorrectly by 
high scorers* 

2. The second comparison of the scaled tests and the 
Metropolitan tests was an attempt to apply the scaling 
methodology in order to derive scaled tests from the items 
of the Metropolitan arithmetic subtests* All the items 
of the arithmetic subtests were classified according to 
their respective behaivioral objectives* In some cases 
more than one item tested a particular objective* When 
this occurred, contrived items were constructed, and the 
criterion of two-thirds correct for passing was employed* 

The items were arranged into tests corresponding to the 
five areas covered by the scaled tests* No items were in- 
cluded which did not represent one of the five areas* 

The Primary I Metropolitan test yielded only one 
test Numeration, which had sufficient items (eight) to 
apply the scale criteria* Guttman had suggested a minimum 
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of ten ii;eiii8, but the criterion was extended to see what 
results would be obtained* Of the remaining for tests from 
the Primary I battery# Subtraction# Addition# and Money 
had three items each* Time was tested with one item* 

The Primary II battery yielded three tests with 
sufficient items to attempt the scale criteria# Addition# 
seven items; Subtraction# eight items; Numeration# eight 
items* Money was tested with four items and Time with 
three# both insufficient* The Elementary battery yielded 
two tests with sufficient items# Addition# fourteen; Sub-* 
traction# nine* Numeration had three items; Mersey, five; 
and Time# one* 7he resultant coefficients are shown in 
Table 20* Only revised coefficients were calculated* 

The number in parentheses after the te.fjt name indicates the 
number of scores employed in the calculations* 



4 



86 



I 









TABLE 20 

REPRODUCIBILITY AMD SCALABILITY OF TESTf DERIVED PROM 
THE METROPOLITAN ACHIEVEMENT TESTS 



Test 


Revised 

Reproducibility 

" . If' 


Revised 

Scalability 

• 


Primary I Numeration (17) 


*934 


*550 


Primary II Addition (20) 


*986 


*780 


Primary II Subtraction (27) 


*921 


*605 


Primary II Numeration (11) 


*920 


*364 


Elementary Addition (27) 


*926 


*636 


Elementary Subtraction (29) 


*958 


*744 



Only two pf the tests met the criteria for scalability. 
Primary II Addition and Elementary Subtraction, but neither 
would have been subjected to the scaling criteria had the 
minimum number of items requirement not been extended* The 
only test for the areas investigated which had the minimum 
number of items wai the Elementary Addition test, but it 
failed to meet the scale criteria* One other test, which was 
not included in the five areas, had sufficient items for 
scaling* This was a multiplication test in the Elementary 
battery* There were twelve items for this test but the scala-- 
bility coefficient was too low, *629* The reproducibility 
coefficient was acceptable at *925* 

3* The third comparison between the two types of 
tests concerned the number of score patterns obtained for a 
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given test score in order to determine if a particular raw 
score represented the same items being answered correctly. 

The initial phase of this comparison involved the raw scores 
of the scaled tests and the raw scores of the tests derived 
from the Metropolitan batteries. For each test the ratio 
of the number of score patterns per test was obtained. This 
ratio employed one of the following two denominators, whichever 
was the minimum: (1) the total number of subjects taking 

the test or (2) the maximum total number of possible score 
patterns. Perfect scores,. zero scores, and scores obtained 
by only one individual were omitted from the analysis because 
only one score pattern was possible for each. Since the 
fewest number of patterns should be present to best interpret 
the items represented by a score, the ratio of patterns to 
test should be as ssmII as possible. The ratios are represented 
in Table 21. For Time only the Metropolitan II test was 
included; the Primary I and Elementary Metropolitan tests had 
only one item each for time. The table includes only the 
maximum values employed. When total subjects is the smaller 
it is presented, when total patterns is the smaller only that 
valne is i^rnsented. 
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TABU 21. 

AHALYSZS or mnuBB or ban scobi pattbbns bbb test 





Tost 


Total 

Subjsoti 


Total 

Maximum. 

PattsrnS 


Total 

Pattsrns 


Pattsrns/Tsit 


Seals Subt* 


n 




30 


.39 


Seals Add. 


71 




26 


.37 


seals Num. 


53 




27 


.51 


Seals Monsy 


78 




32 


.41 


Seals Tims 


73 




62 


.85 


Mstro. I Subt. 


19 




13 


.47 


Mstrd. IX Subt. 


. 24 




10 


.92 


Mstro. Ill Subt 


. 28 




22 


.64 


Mstro. I Add. 


17 




13 


.76 


Mstro. II Add. 


19 




10 


.53 


Mstro. Ill Add. 


22 




22 


l.OO 


Mstro. X Niun. 


19 




11 


.58 


Mstro. II Mum. 


6 




3 


.50 


Mstro. Ill Mum. 


24 




11 


.46 


Mstrii. I MonSy 




' "14'" 


9 ^'" 


.64 


MStro. kx MOtisi^ 


2X 




9 


" .43 ‘ 


MStro.'^'IlI ''Nbns: 


r 2« 


51 . . , , 






MitrOV^Xl^ 'TIMS''" 




: . ^14 i-L 




.43 



89 



The results appeared inconsistent at first. The 
patterns per test ratios for the scaled Addition and Sub- 
traction tests were all lower and, therefore, superior to 
the corresponding Metropolitan tests. The Numeration tests 
were relatively the same, the scaled Money test was superior 
to the Metropolitan 1 test, about equal to the Metropolitan 
II test, and inferior to the Metropolitan III test. The 
scaled Time test was inferior to the Metropolitan II Time 
test. It should be remembered, however, that only the 
Addition and Subtraction scaled tests had good scalability 
and reproducibility coefficients for Oakleaf. The Time and 
Money tests barely met the criteria and the Numeration test 
did not meet the criteria. Therefore, only the Addition and 
Subtraction tests were good examples of scaled tests in the 
present situation. 

The second phase in the comparison of score patterns 
concerned the scaled tests and the tests which were derived 
to be scaled from the Metropolitan batteries. While only 
two of the Metropolitan tests met the criteria for scaling 
and many did not have enough items to even attempt the 
calculation of the criteria, all were included in this analy- 
sis. The reason, wee to determine if any of the tests had 
improved pattern per test ratios which weM comparable to 
the scaled tests. The rasults are prfsen ted in Table 22. 
Again, only those maximum values employed in the calculation 
are presented in the table;* 
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, TABLE 22 « 

ANALYSIS or SCORE PATTERNS PER TEST FOR SCALES DERIVED 

FROM THE METROPOLITAN TESTS 



Test 


Total 

People 


Total 

Maximum 

Patterns 


Total 

Patterns 


Patterns/Test 


Metro. I Subt. 




6 


2 


.33 


Metro. 11 Subt. 


29 




22 


.76 


Metro. Ill Subt. 


27 


J! ■ 


12 


•44 


Metro. I Add. 


r ■> 


3 


1 


.33 


Metro. 11 Add. 


19 




4 

» V‘ ' . 


.21 


Metro. Ill Add. 


24 




23 


.96 


Metro. I Num. 


15 




5 


.33 


Metro. II Num. 


9 




5 


.56 


Metro. Ill Num. 




6 


5 


.83 


Metro. I Money 




6 


4 


.67 


Metro. II Money 




14 


. 7 


• 50 


Metro. Ill Money 


26 




3 


• 12 


Metro. II Time 




6 


4 


.67 



Comparing the results of Table 22 with those of 
Table 21, the scaling methodology resulted in improved patterns 
per test ratios for the derived Metropolitan Addition and 
Subtraction tests, and in some cases the ratios were supe* 
rior to those of the scaled tests. The remainihg results 
were inconsistent, improving in some tests # and showing 
greater patterns per test ratios in other cases. 
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4. fiikal cdnparison batween the scaled tests and 

the Metropolitan tests involved the validity of the tests 

r 1 1 

in terms of hbw f^ccurately they predicted the pupils positions 
in the five units dl the Oakleaf curriculum sequence. The 
predictions follewed the^ same criteria established previously 
for the scaled tfetf # anid were based on raw scores. As a 

|| j - 

result, no contrived items were present for the Metropolitan 
tests. The rjesults^ were compared by grade, and only those 

J; 

pupili whQ had taken both tests were inglu^d.. The results, 
presented in Table 13, indicate the percentages of perfect 
predictions and predictibns one, t%#o, three, and more than 

' ■■ i\ 

a '! ... V) : 

three skills dff. /:fhe nemher in parentheses after the 
test name is the niimber |of predictions mployed in the cal- 
culations, 

' ■' . 0 
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The results appeared. inconsistent. The scale tests 
made better prediqtions in some instances whereas the 
Metropolitan tests were letter in otlkers.^^^^^l^^ a^teropt an 
explanation for these inconsistencies, the number of dif- 
ferent items employed to test the various levels in each 
unit were determined. As discussed in the section devoted 

i • ' 1 ' 

to the validity of the scaled tests, a greater variety of 
items covering a range of behavioral objectives should 
enhance the predictions of pupil position in the curriculum 
sequence. The same was suggested as a reason for the 
differences between the scaled and Metropolitan tests. 

As a result it was necessary to obtain the number of items 
testing veach level of the five units. The scaled tests 
covered all three grades, but each of the Metropolitan 
batteries only covered one grades Therefore, in order to 
compare the two types of tests the range of levels in which 
the pupils of each grade were working were also obtained. 
Only the items included within the range of levels for a 
particular grade were counted. The number of items testing 
each level of the five units are presented in Table 24. The 
braces above the levels represent the ranges of level for 
each grade. 
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TABLi'^al 

NOMBBK Qp itEitt. Tititziie wtcB ucm tiP iBB PIVE 
sElECPED tMtTS iE THE OExiW CORRICDljOM! 
iCHUO) TESTS VERSOS MBTROPOLXTAH TESTS 
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TABLE 24 (continued) 

NUMBER OF ITEMS TESTING EACH LEVEL OF THE FIVE 
SELECTED UNITS IN THE OAKLEAF CURRICULUM: 
SCALED TESTS VERSUS METROPOLITAN TESTS 
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Xn considering tables 23 and 24 together » only eight 
of the fifteen comparisons could be explained by tests having 
a greater variety of items to cover a range of objectives. 

The remaining seven comparisons could not be explained with 
the above reasoning. Closer inspection of the data yielded 
several explanations. 

For first grade Subtraction the scaled test had 
eleven items to the Metropolitan's three# yet the two tests 
were about egual in predicting the pupils within three skills 
of their actual unit and skill. The Metropolitan test# 
however# predicted sixty-three percent of the pupils within 
one skill while the scaled test only predicted forty percent. 
There were seven items in the scaled test which were at 
levels C and D. The last objective tested by the Metropolitan 
test was the last objective in level B. The majority of the 
first grade students (nineteen) did not work at all in the 
Subtraction unit# but were placed at the beginning of level 
C for 1965. These pupils, however# were able to answer items 
in the scaled test which pertained to a higher level# even 
though they had not t^ached that level in the curriculum 
sequence. Because of this the pupils were predicted above 
where they were placed. Such occurred in twelve cases. No 
opportunity to pass such items was present in the Metropolitan 
test. Therefore# all those passing the three items were 
predicted at the first skill of level C. 

The same line of reasoning was suggested as an 
explanation for the results of the second grade Addition 
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tests, the first grade Money tests, and the first and second 
grade Time tests* 

The absence of items at level E for the scaled 
Numeration test, third grade, was suggested as a reason for 
this test having a greater number of predictions more then 
three skills off when compared to the Metropolitan test* 

Only three of the third grade pupils were working in level 
D. All others were working above that level* While only 
three items constituted the Metropolitan Numeration test, all 
were at level E* 

One suggested reason for the difference between the 
third grade Addition tests was that the items at level E on 
the scaled test were easier items than those at level E on 
the Metropolitan* The level E Metropolitan included both 
the addition of four-digit numerals with carrying and state- 
ment problems involving carrying of multiple-digit numerals* 
Neither of these objectives was tested on the scaled tests* 

Some Credence was given this suggestion because twenty 
level E predictions were made from the scaled test and 
eighteen of these were too high* Only nine level E predictions 
were made from the Metropolitan with six being too high. 



VI. COHCLUSIONS AMD SUGGESTIONS FOR FURTHER RESEARCH 



From the reeulta it can be concluded that it is 
indeed possible to construct sequentially scaled achieve- 
ment tests in certain areas of arithmetic. No conclusions 
were reached concerning the Time tests, since further re- 
vision was necessary before it would scale. Several other 
tests, while scaled, require improvement. These results 
pertain to five selected areas in ari thematic covering grades 
one through three; thus the conclusions are restricted to 
these areas. Further investigation at all grade levels in 
arithmetic achievement areas such as multiplication, divi- 
sion, and fractions is warranted. The nature of arithmetic 
lends itself to scaling, since many processes depend on pre- 
viously learned skills. The scaling procedures should also 
be attempted in other subject matter areas to determine the 
scope of application. 

The application of any particular scaled test may 
be limited, however. The results of the Numeration test 
showed that a test may scale for one group and not another. 

The rank difference correlations between the tests for the 
Oakleaf and Sickman subjects indicated fluctuation in the 
item orders between groups. This result meant that a given 
order of items may scale for one group but need rearrangement 
to scale for another group. Therefore, when describing a 
test as scaled the statement should pertain to a specific 
group and order of items. A reason suggested for the variation 
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in itm ordnrs it that two difftrtnt typmu of tehoolst ^rm 
•npioytd in tht prtotnt invottigttion and tha ordart in 
whioh tha bahavioral objaetivat wara taught may hava not 
haan tha taina. This tuggattad that whila tha scaling matho- 
dology may hava a wida ranga of application, any spacific 
scalad tast may hava a rathair rastrictad ranga of application 
dapandant on tha ordar of tha objaetivat in tha curriculum* 

It would ba hasardout, tharafora, to attampt to infar 
bahavior from total aoora for a group whoaa curriculum 
aaquanca diffarad from tha taquanoa on which tha acalad 
tast waa batad* 

Thata ratultt alto tubstantiatad tha waminga of 
tchuatalar, Torgarton, Campball, and Karckhoff (taa paga 26) 
who atatad that ona could not concluda, on tha basis of a 
tampla, that a univaraa was acalad* In tha prasant in- 
vaatigation tha aama itama had a diffarant acalad ordar for 
two diffarant aamplaa* This auggaata that cartain skills 
in mathamatics may not ba praraquiaita to othara but, 
rathar, dapand on tha ordar in which thay ara taught* Tha 
aealing procadura, tharafora, may hava application in tha 
datarmination of praraquiaita akilla* Such application 
should ba tha topic lOf futura^. rasaarch* 

Tha final ordar a of itama which obtainad tha maxi- 
mum criteria for aoalaa in^ tha Oakl^af and Sic^ aamplaa, 
raapactivaly wars not tha aama orders as logically postulated 
for tha objactivaa* Thiaa suggaatad that^^ampirieal varif ica- 
tion thresh the aealing mathodology ^ouid ba mttamptad 
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before an order of objectives and their corresponding items 
are considered scaled « 

The resultant item orders also indicated that the 
objectives in the Oakleaf curriculum were not in the best 
sequence. For example, in both the Oakleaf Subtraction and 
Addition tests the objectives and items involving borrowing 
or carrying with single or multiple-digit numerals were more 
difficult than those which involved multiple-digit numerals 
without either borrowing or carrying. In each instance, 
however, borrowing and carrying with single-digit numerals 
was taught at the level before multiple-digit numerals 
without borrowing or carrying. Such examples suggested that 
the use of the scaling methodology may have application in 
curriculum analysis. Such an application should be the topic 
of future research. 

The scaled tests were constructed as one method of 
obtaining greater meaning frcxn test raw scores namely, to 
be able to infer behavior from raw score. The scaled test 



raw scores should indicate what specific behaviors a pupil 
has mastered on the test. Excluding the results of the Time 
test which was nbt scaled I such inferences were possible 
from the tests obtained in the present investigation. In 
each test from eighty-seven to niAety-seVen percent of the 
total raw scores were within one item of re'lfiresenting the 
first n Items-, passe#, t 

' ^^TheseN^resulte'^iconoerning the Items trepresented^'^by ' 
thm total' ^;eeore'’:eere in ' txmtrastnto 'Cthe^^^^results f resi:i^the>^ 
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total raw acoraa of a nom-rafarancad taat, tha Matropolitan 
Achiavainant Taat • Bxcapt to stata that ao many itama ware 
paaaad and failad, no moaning could ba givan to tha total 
raw acoraa of the two arithmetic aubteata of the Metropolitan 
at each of the firat three gradea. Theae total raw acorea 
included itema from Addition , Subtraction , Multiplication, 
Diviaion, Fractiona, Money, Time, etc., but there waa no 
way of telling from the total acore which of theae itema ware 
paaaed and which were failed. Even with the Metropolitan 
itema identified according to unite, the acaled teata had 
more atable acore atterna. Thia latter reault ahould be 
further aubatantiated, however, for it waa baaed on only two 
acaled teata. Addition and Subtraction. 

The itema which appeared in the acaled teata and in 
the Metropolitan teata were aimilar in many caaea. When 
the acaling methodology waa applied to the unit teata derived 
from the Metropolitan two acaled teata were obtained. The 
apparent reaaon for more acaled teata not being obtained 
waa lack of item rather than differencea in itema. The 
eaaential point waa that acaled teata could be obtained from 
a norm-referenced teat. Thia auggeated that the tm typea 
of teata, acaled and norm-referenced, differed mainly becauae 
of the nature of the information deaired from the raw acore. 

The reaulta of application of two item analyaia pro- 
cedurea to the acaled teata euggeatedt that the upperrlower 
twenty-^aeven percent method waft tnai^ropriate fo^ teata 

aince ltema' ^rejected ^by^ thie^^pyrocmHaurey:m good» itmne 
acale. The reaulta alao further aubatantiated the 
contention that the type of item analyaia applied ahould 
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depend on the type of test desired. Further research is 
needed in this area with the investigation covering a 
wider range of item analysis procedures and varying types 

of tests. 

. 

The split-half reliability coefficients obtained for 
the scaled tests were of the same magnitude as those of the 
Metropolitan subtests even though the procedure was more 
suited to the Metropolitan. Because of the contaminating 
factors the scaled tests may well have higher split-half 
reliability than the Metropolitan arithmetic Subtests* 

Further research is suggested in the area of test- 
retest reliability for scaled tests. One test-retest 
procedure which could be attempted would be to administer 
the scaled test, wait a day, and re-administer the test* 

This should practically eiiminate the contaminating factor 
of having the ranks of students change. If extended periods 
of time are allowed to pass between administrations , a 
procedure might be develoi^d which would encompass only the 
items passed on the initial test administration. That is, 
if a pupil passed the first h items ^he initial time he 
should have passed at least the firit n items the second 
time* ’ ;> 1. ^ 'i , * . 

The Validity procedure which involved predicticns 
of pupil position in the curricuiuBi se^u^c# is suggested 
^ae previdinp u- type of -'cou^ruct '^veiidity-^^'^ fhe ‘''results 
eu#geited-thet^ this^^-type' of validity '^^oail^be^ii^reased' ^by 
'^ereatir ''COeerige^of ‘--the bihiViofai '^Objectives in a curriculum 



105 



sequence. While this conclusion shqu Id, ha been obvious# 
the results afforded empirical evidence. The results also 
nnggested certain contaminating factors for this type of 



validation: (1) If the objectives are not in the same 

sequence in the curriculum as they are in the test# the 
resultant predictions may be less accurate than if the 
orders were the same* (2) Two of the factors appeared 
to interact: the number of items and the number of objectives 

tested. It is suggested that this type of validity will be^ 
improved with more objectives in a given level being assessed 
with a sufficient number of items. Further research is 
needed, however# to determine the amount of interaction of 
the above factors and to determine how many items constitute 
a sufficient number to test an objective. 

To employ the above validity procedure# however# a 
daily record of pupil achievement and progress in the cur- 



y V I 






riculum sequence is required. Future research in this area 
could involve the evaluation of this validity procedure as 
a type of construct validity# and also the evaluation# employing 
this validity procedure# of tests currently being employed in 
the schools as measures of achievement. 



In addition to consideration of validity# future 
investigation should employ caution in the construction of 
scaled tests. When constructing scaled tests it should 
be made certain that all groups to be tested follow the 
same curriculum sequence# or adjust for differences being 
made by scaling for each group separately. 
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If behavior ia to be inferred from test score both 
an individual's responses and the order of items should 
be consistent* Therefore# future investigations should 
consider both types of reliability# stability of test scores 

a 

and stability of item orders# when a scaled test is evaluated* 
Caution should also be exercised in the selection of itM 
analysis procedures since the type of test desired may well 
dictate the item analysis procedure to be employed* 

It is also suggested that the scaling methodology 
has application in the schools for use by teachers and 
curriculum .designers* The methodology is not complicated, 
and therefore# should be readily accessible to the classroonn 
teacher* Given the objectives a teacher who can write good 
test items should find the methodology useful for placement# 
diagnostic# and achievement testing* 

From the results of the present investigation 
those concerned with curriculum design should also find 
application for the scaling methodology* The methodology may 
be employed as a device for analysing curriculum sequences or 
for determining certain pi'erequisite skills* 
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Add 2 ^o-digit nuaerals .with 



■■•2 -"APPENDIX B 
• MONEY OBJECTIVES 

The student will be able to: 

1. Identify pennies, nickels, and dimes. 

2. Give the value of a penny, nickel, or dime in cents. 

3. Identify the dollar and cent signs. 

4. Give the value of a combination of two coins (penny, 
nickel, and dime). 

5. Give the value of a combination of 3 or more coins 
(penny, nickel, dime). 

6. Identify quarters, half-dollars, and dollars. 

7. Identify equivalent amounts of money. 

8. Give the number of pennies, nickels, dimes, quarters, 
and half-dollars in a dollar. 

9. Give the value of a sum of money expressed in decimal 
notation . 

10 . Add two amounts of money expressed in decimal notation 
without carrying. 

11. Add two amounts of money expressed in decimal notation 

with -carrying . , . ■ - : ■ • ^ ^ r. . 

12. Can make change for amounts of money of $5.00 or less. 

ADDITION OBJECTIVES 
The student will be able to: 

1. Count the number of objects in a set (less than 10) . 

2. Add two single-digit numerals with suns less than 10. 

^ - /U i •r.i il V a r r ' 

a. horizontally arranged 

' 2* ''..he n/u-Tic « iS ,r 
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b. vertically arranged ^ - 

(Whete not specified the numerals are arranged 
vertically) 

3* Add two single-^digit numerals with sums greater than 
or equal to 10. 
a. horizontally arranged 
b* vertically arranged 
4 • Add three single*digit numerals • 

5. Add two two^^digit numerals without carrying. 

6. Add three two-digit numerals without carrying. 

7. Identify the proper way to arrange numerals so that they 
could be added. 

8. Add two three-digit numerals without carrying. 

9. Add three three-digit numerals without carrying. 

10. Add two two-digit numerals with carrying. 

11. Add two three-digit numerals with carrying. 

12. Add three two-digit numerals with carrying. 

SUBTRACTION OBJECTIVES 
The Student will be able to: ; 

1 . Subtract one group of objects from another, 10 objects 
or less. 

2. Subtract using the expression "take away," with 

numerals'' less '^than< 10 i^ „ 

3. Subtract sirigle-digit numerals. 

a . horizontally . arranged ^ ^ '' 

b. vertically tarrlunged^^'^^>u'£ 



(Where not specified the numerals are arranged 
vertically) 
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4 • Subtract a ■ingle^digit nlwietf^l from a two-digit numeral 
(less than 20) without borrowings 

5. Identify the proper way to arrange numerals so that they 
can be subtracted • 

6. Subtract a single-digit numeral from a two-digit numeral 
(less than 20) with borrowing. 

7. Subtract two two-digit numerals without borrowing. 

8. Subtract a two-digit numeral from a three-digit numeral 
without borrowing. 

9. Subtract two three-digit numerals without borrowing. 

10. Subtract two two-digit numerals with borrowing. 






TIME TELLING OBJECTIVES 
The student will be able to: 

1. Pill in the missing numerals on a clock face. 

2 . Identify the hour and : minute hands . 

3. Tell time to the hour. 

4 . Tell time to the half-hour . 

5. Tell time to the quarter-hour. 

6. Count the number of minutes between two points on a 
clock. 

a. , up to- 30 Mnutes': ^ 

b. from 30 minutes to 1 our ^ 

♦ 

7. Discriminate between clock faces when the time ii expressed 

a^ ^ minutesi after o fe. (half<^hour)& 4 r 
b^‘^ j -'-'j thirty, (half-hour) 
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8. Write the time, in muiterale (e.g. 8:00), when given a time* 
9 • Write the time in numerals when given a clock face : 
a • to the hour 

b. to the half-hour 

c. to the quarter-hour 

d* to five -minute intervals 
e. to the minute 






NUMERATION OBJECTIVES 

The student will be able to: 

1. Recognize numerals from 1 to 10* 

2. Write numerals sequentially from 1 - 10* 

3. Find the larger or smaller of two numerals from 1-10* 

4. Write the numeral that comes just after a numeral from 

5* Write the numeral that comes just before a numeral from 1-10. 

6* Recognize numerals from 1 - 100* 

7. Write numerals sequentially that come between two given 
numerals from 1 - 100. 

8. Find the larger or smaller of two numerals from 1 - 100. 

9. Count the number of objects in a group presented visually. 

10. Count by fives from 1 100* 

11. Write the numerals that come before and after a given 
number or series of numbers from 1 - 100. 

12. Write sequentially, counting by ones, the numerals that 
follow a given numeral from 100 - 1000* 

13* Write the numeral thaw comes just after e given numeral 
from 100 - 1000. 

•N-* 

14. Write the numeral that comes just before a given numeral 
from 100 - 1000* , 
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APPENDIX C 

A SHORT DESCRIPTION OF MATHEMATICS UNITS 
1* A Nraeration - Counting to ten* 

2. * A Addition - Addition to sums of six with pictured objects* 

* > 

3* A Fractions » Identification of 1/2 of objects and small 

sets* 

4* A Money - Recognition of common coins (penny# nickel# dime)* 

5* A Time - The day as a unit of time* 

6* A Systems of Measurement ” Qualitative dimensional 

discrimination by verbal directions* 

7* A Geometry - Recognition of simple geometric figures* 

8* B Numeration - Counting to 100* Use of ordinals to 10th* 

9* B Addition » Addition to sums of 10* 

10* B Money - Beginning money equivalents (5^ ■ 1 nickel)* 

11* B Time - Clock reading to the hour* 

12* B Systems of Measurement » Beginning equivalent length 

(3 ft* • 1 yd*) * 

13* B Geometry - Draws simple geometric figures* 

14* C Numeration - Counting to 150* 

15* C Place Value - Place value charting to hundreds* 

IS* C Addition - Two digit sums without carrying but with 

expanded notation* 

17* C Subtraction - Two digit differences without carrying but 

with expanded notation* 

18* C Combination of Processes - Word problems with skills 

learned to this point plus 
selfction of proper operation 
to |olve problems* 

19* C Fractions - With fractions to 1/4 divides single objects 

and groups of objects* 
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20. c M«a«y • Ocaetioci eS {Mnny. nlckal* dlnat und 

quartart :■ ■ " 



\\ - r > 
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21 * C TinMi •» Solves |>sol»lS8iS i?eqttitiiif addition or siibtvaction 

of 

22. c Systems of Measurement *** Converts unitss inches ** feet^ 

pint - quart - cup, dozen « 1/2 
"^dozenaw^;. 

23a C Geometry Recognizes and names solid geometric figures a 

24 a C Special Topics - l^Sds lumian numerals ^^^t^ 

thermometer; reads charts and graphs a 

23 a D Numeration Counting toi 1^000 (readi^^ and writing 

numerals witl^ skip countinga)^ 

26a D Place Value - Makes and reads place value charts to 

thousands . 

27a D Addition - Begins addition with carrying a 

28a D Subtraction - Begins subtraction with borrowing a 

29a D Multiplication • Does multiplication as repeated addition. 

Memorizes tables through 5 x 5a 

30a D Division - Does division as partition, inverse to 

addition# and memorises tables through 25 
divided by 5a 

31a p Combination of Processes Solves problems requiring 

selection and discrimination 
of many processes a 

32a D Fractions - Applies fractional concepts (2/3, 3/4) to objects 

and groups# B operations 

(1/2 X 8 » ?), 

33, D Money - Operates with money values to $5 a 00 a 

34a D Time Tells time to the minute and uses time in problems, 

35a D Systems of Measurement - Extends linear and volume systems 

and begins metric system with 
centimeters a 

36a D Geometry - Identified open versus closed curves, line 

segments versus lines a 

37a D Special Topics ^ Reads Roman numerals to 30a 
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38. 

39. 

40. 

41. 

42. 



43. 

44. 



45. 

46. 

47. 

48. 

49. 



50. 



B Mum.r.tion ~ v.r.u. avan nunbars> rounds 

j,n4..««t4p»atas,nuibf 

B Place Value - Uaes place value to inlllionai begins 

eifiponents of base 10 # 



E Addition - Peirlbirii^^^s^ to thousands. 

;‘i| ;; 1;; l | I'K- 1 ^ , 

E Subtraction to bundreds. 
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E Multiplication ^ b(^s nul^iplioAtioh as repeated addition. 

distributive 

^V-i^inc^l# (does.» si^ 

^ j ;iiiih^|c4*^i^iji^ .L..- 3 



fe: f. 



E Division - Uses lidte 4 ^ division 

E Combination pf. 



■ 1, 
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4 ^,So|^ 4 .^ubin 9 n as . variable. - 
jDosi^l^perations with competing 






E Fractions l^eatif iei:-' ggtiiVaJ^^ adds 

f radtion with a fconiibn denominator • 

E Money - Adds in# iubtracti mdhey values using decimal 
notation#.;^. S’-'- 

E Time - Uses sicgnds in time problems. 

E Systems of Meiiuiiment ^ Adds and! subtracts measures 

: ;by"regrpi^ when necessary. 

E Geometry - Zdintilitssimpli line figures (equilateral 

i;riAiigle,:.#uldtil parallel lines, 

:niiiappiiit#:-Jnd [|oiiiii,..-^right angle, 
:inie|sictiig:^l^es;..^^ lines) • 
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E Special ToF|i^;^ 4 UsiigsiJ^^ 
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APPENDIX D 

SCALOGRAM OF OAKLEAF SUBTRACTION TEST 
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